Specified Life: Perl implementation of doublet deidentifier

Tuesday, January 15, 2008

Perl implementation of doublet deidentifier

Here is the Perl code for implementing the doublet deidentifier (medical record scrubber).

It operates on a collection of over 15000 PubMed Citations (author line and title line), and uses a publicly available external list of "safe" doublets. A plain-text file of doublets is available.

The entire output of the script is available for review.

As with all my distributed scripts, the following disclaimer applies:

The perl script for deidentifying text using the doublet method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.


#!/usr/local/bin/perl
$begin = time();
open(TEXT,"doublets.txt")||die"cannot";
$line = " ";
while ($line ne "")
  {
  $line = $getline = <TEXT>;
  $getline =~ s/\n//;
  $doublethash{$getline}= "";
  }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following to create ";
print STDERR "the doublet hash is ";
print STDERR "$totaltime seconds.\n\n";
close TEXT;
$begin = time();
open(TEXT,"pathol5.txt")||die"cannot";
open(STDOUT,">pathol5.out")||die"cannot";
$line = " "; $oldthing = ""; $state = 0;
while ($line ne "")
   {
   $line = <TEXT>;
   next if ($line eq "\n");
   print "Original - $line" . "Scrubbed - " ;
   $line =~ s/[\,\.\n]//g;
   $line = lc($line);
   my @linearray = split(/ /,$line);
   push (@linearray, "lastword");
   foreach $thing (@linearray)
     {
     if ($oldthing eq "")
        {
        $oldthing = $thing;
        next;
        }
     $term = "$oldthing $thing";
     if (exists($doublethash{$term}))
        {
        print "$oldthing ";
        $oldthing = $thing;
        $state = 1;
        next;
        }
     if ($state == 1)
        {
        if ($thing eq "lastword")
          {
          print $oldthing;
          print "\.\n";
          $oldthing = "";
          $state = 0;
          next;
          }
        print "$oldthing ";
        $oldthing = $thing;
        $state = 0;
        next;
        }
     if ($state == 0)
        {
        if ($thing eq "lastword")
          {
          print "\*\.\n";
          $oldthing = "";
          next;
          }
        print "\* ";
        $oldthing = $thing;
        next;
        }
     }
   }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following ";
print STDERR "doublet hash creation";
print STDERR " is $totaltime seconds.";
exit;

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

Jules J. Berman, Ph.D., M.D.
tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, de-identification, deidentification, doublet method, medical scrubber, Perl script