Thursday, January 17, 2008

Fast deidentifier that preserves punctuation

On Tuesday, Jan 17, 2008, I provided a very simple, fast, and almost perfect medical record de-identifier perl script . The script uses a public domain list of about 200,000 word doublets.

A public domain file shows the output this de-identifier with an input of about 15000 PubMed medical citations. PubMed citations are an excellent way to test de-identifiers because they are non copyrighted, they contain lots of medical vocabulary, and they are full of identiifiers (the names of the authors).

The provided output file does not preserve the punctuation in the original text.

It is easy to modify the Perl script to preserve case (lowercase, uppercase) and punctuation from the original text, and the output of the modified script is also available.

As with all my distributed scripts, the following disclaimer applies:

The perl script for deidentifying text using the doublet method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

The revised Perl script is shown here:


#!/usr/local/bin/perl
$begin = time();
open(TEXT,"doublets.txt")||die"cannot";
$line = " ";
while ($line ne "")
{
$line = $getline = <TEXT>;
$getline =~ s/\n//;
$doublethash{$getline}= "";
}
$end = time();
$totaltime = $end - $begin;
print STDERR "Time to create ";
print STDERR "the doublet hash is ";
print STDERR "$totaltime seconds.\n\n";
close TEXT;
$begin = time();
open(TEXT,"pathol5.txt")||die"cannot";
open(STDOUT,">pathol5.out")||die"cannot";
$line = " "; $oldthing = ""; $state = 0;
while ($line ne "")
{
$line = <TEXT>;
next if ($line eq "\n");
print "Original - $line" . "Scrubbed - " ;
$line =~ s/\n$//;
#$line =~ s/\n/ /o;
my @linearray = split(/ +/,$line);
push (@linearray, "lastword");
foreach $thing (@linearray)
{
$originalthing = $thing;
$thing = lc($thing);
$thing =~ tr/a-z\'\-//cd;
if ($oldthing eq "")
{
$oldthing = $thing;
$originaloldthing = $originalthing;
next;
}
$term = "$oldthing $thing";
if (exists($doublethash{$term}))
{
print "$originaloldthing ";
$oldthing = $thing;
$originaloldthing = $originalthing;
$state = 1;
next;
}
if ($state == 1)
{
if ($thing eq "lastword")
{
print $originaloldthing;
print "\n";
$oldthing = "";
$state = 0;
next;
}
print "$originaloldthing ";
$oldthing = $thing;
$originaloldthing = $originalthing;
$state = 0;
next;
}
if ($state == 0)
{
if ($thing eq "lastword")
{
print "\*\.\n";
$oldthing = "";
next;
}
$punctuation = substr($originaloldthing,-1,1);
if ($punctuation =~ /[a-zA-Z0-9]/)
{
$punctuation = "";
}
print "\*" . "$punctuation ";
$oldthing = $thing;
$originaloldthing = $originalthing;
next;
}
}
}
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following ";
print STDERR "doublet hash creation";
print STDERR " is $totaltime seconds.";
exit;


- Jules Berman
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, deidentified, deidentifier, hipaa, medical de-identifier, medical scrubber, scrubbed text