Specified Life: De-identifying a public domain book with the doublet method

Friday, January 18, 2008

De-identifying a public domain book with the doublet method

In the last few blogs, I've been discussing the doublet method medical records scrubber. The doublet method de-identifier will accept any text file. To demonstrate the versatility of the doublet method, and to serve as a source of comparison with other de-identifiers, I downloaded a public domain book from Project Gutenberg, and posted the de-identified output, of the entire book, at the following URL:

http://www.julesberman.info/aacom10.htm

Project Gutenberg is a remarkable resource that publishes plain-text versions of literary gems that have passed out of copyright. I used Anomalies and Curiosities of Medicine by George M. Gould and Walter Lytle Pyle. This book has lots of medical terminology and vaguely resembles the kind of text that might be included in a pathology report. Anyone can download the same text from:

http://www.gutenberg.org/etext/747

A public domain list of doublets, doublets.txt, used in the script, is available for download, but I cannot guarantee that the list is identifier-free or that it is the best list for your purposes. Feel free to modify the list, add to the list, or create your own list of identifier-free doublets. In the script, "aacom10.txt" is the Project Gutenberg file for Anaomalies and Curiosities of Medicine.

An example output paragraph is shown. As expected with the doublet method, there are many blocked words. This is a limitation of the doublet method. If you use the standard list of doublets on any random book, you're bound to block some innocent doublets that weren't included in the "approved" list. The only way to get around this limitation is to try to add safe doublets (from the text) to the "approved" list.

In this important *, *, * * some historical *, describes a long series of experiments performed on * in order to * the passage of *, *, *, *, *, *, * * the placenta. The placenta shows a real affinity for * substances; in it * copper and mercury, but *, and it is therefore * it that the * * *; in addition to its *, intestinal, and *, * * glycogen and acts as an * *, and so resembles in its action the liver; * * of the fetus * only a potential *. * up of * in the placenta is not so general as * of them in the liver of the mother. It may be * the placenta does not form a barrier to the passage of * the circulation of the fetus; this would seem to * * *, which was always found in the * never in the fetal organs. In * * lead and * accumulation of the * in the fetal tissues is * in the maternal, perhaps from differences in * * or from greater diffusion. * it is * * barrier to the passage of *, * * * * degree of obstruction: it allows copper and * * *, * with greater difficulty. The * toxic substances in the fetus does not follow the same * * the adult. They * more widely in the fetus. In the * liver is the chief * *. *, which in * * to accumulate in the liver, is in the fetus * in the skin; copper accumulates in the fetal liver, * system, and sometimes in the skin; * which is * in the maternal liver, but also in the skin, has * in the skin, liver, * centers, and elsewhere * *. The frequent presence of * in the fetal * its physiologic importance. It has probably not * * influence on its *. On the * in the placenta and nerve * * * * abortion and the birth of dead *) Copper and lead did not cause *, * * so in two out of six *. Arsenic is a * agent in the *, * * * * *. An important * is that * * is frequently and seriously affected in syphilis, * * the special * for the accumulation of *. * * * * * action in this disease? The * of lead in the central nervous system of the * the frequency and serious character of * lesions. The presence of * in the * * * an explanation of the therapeutic results of * of this substance in skin *.

The strength of the doublet method is speed (the 2.4 Megabyte book was de-identified in 3 seconds, much faster than other de-identifiers described in the literature). Also, the doublet method is virtually perfect. I have never encountered a missed identifier in text scrubbed by the doublet method. If you find any identifiers in the de-identified book, please let me know. Finally, the doublet method is simple. The Perl script that I used to scrub the book is shown below, in its entirety.

As with all my distributed scripts, the following disclaimer applies:

The perl script for deidentifying text using the doublet method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.


#!/usr/local/bin/perl
$begin = time();
open(TEXT,"doublets.txt")||die"cannot";
$line = " ";
while ($line ne "")
  {
  $line = $getline = <TEXT>;
  $getline =~ s/\n//;
  $doublethash{$getline}= "";
  }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time to create ";
print STDERR "the doublet hash is ";
print STDERR "$totaltime seconds.\n\n";
close TEXT;
$begin = time();
$/ = "\n\n";
open(TEXT,"aacom10.txt")||die"cannot";
open(STDOUT,">aacom10.out")||die"cannot";
$line = " "; $oldthing = ""; $state = 0;
while ($line ne "")
   {
   $line = <TEXT>;
   next if ($line eq "\n");
   #print "Original - $line" . "Scrubbed - " ;
   $line =~ s/\n$//;
   $line =~ s/\n/ /o;
   my @linearray = split(/ +/,$line);
   push (@linearray, "lastword");
   foreach $thing (@linearray)
     {
     $originalthing = $thing;
     $thing = lc($thing);
     $thing =~ tr/a-z\'\-//cd;
     if ($oldthing eq "")
        {
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        next;
        }
     $term = "$oldthing $thing";
     if (exists($doublethash{$term}))
        {
        print "$originaloldthing ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        $state = 1;
        next;
        }
     if ($state == 1)
        {
        if ($thing eq "lastword")
          {
          print $originaloldthing;
          print "\n\n";
          $oldthing = "";
          $state = 0;
          next;
          }
        print "$originaloldthing ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        $state = 0;
        next;
        }
     if ($state == 0)
        {
        if ($thing eq "lastword")
          {
          print "\*\.\n\n";
          $oldthing = "";
          next;
          }
        $punctuation = substr($originaloldthing,-1,1);
        if ($punctuation =~ /[a-zA-Z0-9]/)
           {
           $punctuation = "";
           }
        print "\*" . "$punctuation ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        next;
        }
     }
   }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following ";
print STDERR "doublet hash creation";
print STDERR " is $totaltime seconds.";
exit;

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, de-identification, doublet method, electronic medical record, medical scrubber, de-identification, doublet method, electronic medical record, medical scrubber, privacy, confidentiality