Specified Life: January 2008

Thursday, January 31, 2008

More on "iform" words in UMLS

Here is the Perl script that extracted the "iform" words from UMLS:



#!/usr/local/bin/perl
open (TEXT, "MRCONSO");
$line = " ";
while ($line ne "")
     {
     $line = <TEXT>;
     next if ($line !~ /ENG/);
     if ($line =~ /\b[a-z]+iform\b/i)
       {
       $term = lc($&);
       $subhash{$term}++;
       }
     }
foreach $key (sort keys %subhash)
   {
   print "$key\n";
   }
exit;

The MRCONSO file (previously called the Mr. Con file) is the large (greater than 800 Megabyte) UMLS Metathesaurus file that contains all of the metathesaurus terms. It is available free from the U.S. National Library of Medicine, but you need to register and complete an online license agreement before they will release the metathesaurus files to you.

The Perl script (above) can be easily modified for simple extraction projects. If you're interested in learning Perl to help you with biomedical projects, you might want to read my book, Methods in Medical Informatics: Fundamentals of Healthcare Programming in Perl, Python, and Ruby (Chapman & Hall/CRC Mathematical and Computational Biology).

Here is the complete list of "iform" words:

acneiform
acniform
ansiform
apoplectiform
arciform
bacilliform
canaliform
cerebriform
chancriform
choreiform
chyliform
coliform
coralliform
cordiform
cribiform
cribriform
cruciform
cuciform
cuneiform
cupuliform
curariform
dendriform
dermiform
disciform
emboliform
epileptiform
equiform
falciform
fetiform
filariform
filiform
filliform
flagelliform
framboesiform
fundiform
fungiform
fusiform
gadiform
gelatiniform
gigantiform
gyriform
herpetiform
hydatidiform
ichthyosiform
intercuneiform
juxtarestiform
kaposiform
lentiform
licheniform
moniliform
morbilliform
multiform
myrtiform
naviculocuneiform
neuralgiform
nonhydatidiform
noviform
pampiniform
pectiniform
perciform
piriform
pisiform
pityriasiform
plexiform
prepiriform
prepyriform
proteiform
psoriasiform
punctiform
pyriform
rediform
reniform
restiform
retiform
retrolentiform
rhabditiform
rubelliform
sacciform
scarlatiniform
scarletiniform
schizophreniform
sclerodermiform
scorpaeniform
spongiform
storiform
subcuneiform
sublentiform
unciform
uniciform
uniform
valpiform
varicelliform
varioliform
vermiform
verruciform
vitelliform
zosteriform

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, rare disease, disease genetics, biomedical informatics, perl programming

Wednesday, January 30, 2008

Arcane "iform" words in UMLS

Anonymous commented, on my Jan 7 blog Possessive forms of eponymous neoplasms:

"Interesting post. As a pathologist, I must interject that I have never used the term "rubriform" and don't know of any pathologists in the US that use it either. Perhaps it is an old term from the literature? Many of the terms for skin diseases, used by both dermatologists and dermatopathologists, are famously baroque - I can imagine a "rubriform" in that arena somewhere..."

Anonymous is correct. Most pathologists would never use "rubriform."

To satisfy my own curiosity, I extracted all of the English "iform" words found in UMLS.

Here they are:

acneiform
ansiform
apoplectiform
bacilliform
canaliform
cerebriform
chancriform
choreiform
chyliform
coliform
coralliform
cribiform
cribriform
cruciform
cuneiform
disciform
emboliform
epileptiform
falciform
filariform
filiform
flagelliform
fundiform
fungiform
fusiform
gigantiform
herpetiform
hydatidiform
hydatiform
ichthyosiform
intercuneiform
juxtarestiform
lentiform
morbilliform
multiform
neuralgiform
nonhydatidiform
pampiniform
piriform
pisiform
plexiform
prepyriform
proteiform
psoriasiform
punctiform
pyriform
reniform
restiform
retiform
retrolentiform
rubelliform
sacciform
scarlatiniform
schizophreniform
spongiform
storiform
subcuneiform
sublentiform
unciform
uniform (unintended)
varicelliform
varioliform
vermiform
verruciform
vitelliform
zosteriform

The UMLS seems to be missing a few that I have seen:

morpheiform
moniliform

Comments sent to the "Specified Life" blog

I'd like to apologize to some of the people who have left blog comments. I was unaware until now that I could set up an alert that would notify me when comments are posted. Consequently, people have posted excellent comments that went unnoticed until this past week when I happened to review all of my past blogs.

Particularly, my colleagues, Drs. Pete Patterson, Jim Harrison and Bob McDowell all left comments that corrected errors in my posts, and I belatedly thank them.

The Comments section is now open to everyone and is unmoderated.

Also, I added an email link to the blog. Now you can alert your friends to interesting or useful blogs on the Specified Life site. The email that is sent contains a web address link to the blog (and does not contain the contents of the blog). It's probably best to add a sentence describing the blog in the provided comment box for the email alert, so your colleague will know something about what to expect to find when he/she navigates to the blog address.

- Jules Berman

Friday, January 18, 2008

Corrections to Perl scripts

I am very grateful to Dr. Robert McDowell for uncovering a presentation problem in all of the Perl scripts that I've posted in prior blogs.

When the blog software encounters a Perl get-file command (<FILENAME>), it misinterprts the Perl expression as an HTML tag and suppresses its visibility.

Consequently, all of the Perl scripts that called for a line of a file (i.e., most of my posted Perl files) had a missing filehandle. I've gone back through all of the old posts and have substituted HTML bracket expressions (<, >) where appropriate, and I think this should have fixed the problem.

I apologize for any inconvenience this may have causes.

- Jules Berman

De-identifying a public domain book with the doublet method

In the last few blogs, I've been discussing the doublet method medical records scrubber. The doublet method de-identifier will accept any text file. To demonstrate the versatility of the doublet method, and to serve as a source of comparison with other de-identifiers, I downloaded a public domain book from Project Gutenberg, and posted the de-identified output, of the entire book, at the following URL:

http://www.julesberman.info/aacom10.htm

Project Gutenberg is a remarkable resource that publishes plain-text versions of literary gems that have passed out of copyright. I used Anomalies and Curiosities of Medicine by George M. Gould and Walter Lytle Pyle. This book has lots of medical terminology and vaguely resembles the kind of text that might be included in a pathology report. Anyone can download the same text from:

http://www.gutenberg.org/etext/747

A public domain list of doublets, doublets.txt, used in the script, is available for download, but I cannot guarantee that the list is identifier-free or that it is the best list for your purposes. Feel free to modify the list, add to the list, or create your own list of identifier-free doublets. In the script, "aacom10.txt" is the Project Gutenberg file for Anaomalies and Curiosities of Medicine.

An example output paragraph is shown. As expected with the doublet method, there are many blocked words. This is a limitation of the doublet method. If you use the standard list of doublets on any random book, you're bound to block some innocent doublets that weren't included in the "approved" list. The only way to get around this limitation is to try to add safe doublets (from the text) to the "approved" list.

In this important *, *, * * some historical *, describes a long series of experiments performed on * in order to * the passage of *, *, *, *, *, *, * * the placenta. The placenta shows a real affinity for * substances; in it * copper and mercury, but *, and it is therefore * it that the * * *; in addition to its *, intestinal, and *, * * glycogen and acts as an * *, and so resembles in its action the liver; * * of the fetus * only a potential *. * up of * in the placenta is not so general as * of them in the liver of the mother. It may be * the placenta does not form a barrier to the passage of * the circulation of the fetus; this would seem to * * *, which was always found in the * never in the fetal organs. In * * lead and * accumulation of the * in the fetal tissues is * in the maternal, perhaps from differences in * * or from greater diffusion. * it is * * barrier to the passage of *, * * * * degree of obstruction: it allows copper and * * *, * with greater difficulty. The * toxic substances in the fetus does not follow the same * * the adult. They * more widely in the fetus. In the * liver is the chief * *. *, which in * * to accumulate in the liver, is in the fetus * in the skin; copper accumulates in the fetal liver, * system, and sometimes in the skin; * which is * in the maternal liver, but also in the skin, has * in the skin, liver, * centers, and elsewhere * *. The frequent presence of * in the fetal * its physiologic importance. It has probably not * * influence on its *. On the * in the placenta and nerve * * * * abortion and the birth of dead *) Copper and lead did not cause *, * * so in two out of six *. Arsenic is a * agent in the *, * * * * *. An important * is that * * is frequently and seriously affected in syphilis, * * the special * for the accumulation of *. * * * * * action in this disease? The * of lead in the central nervous system of the * the frequency and serious character of * lesions. The presence of * in the * * * an explanation of the therapeutic results of * of this substance in skin *.

The strength of the doublet method is speed (the 2.4 Megabyte book was de-identified in 3 seconds, much faster than other de-identifiers described in the literature). Also, the doublet method is virtually perfect. I have never encountered a missed identifier in text scrubbed by the doublet method. If you find any identifiers in the de-identified book, please let me know. Finally, the doublet method is simple. The Perl script that I used to scrub the book is shown below, in its entirety.

As with all my distributed scripts, the following disclaimer applies:

The perl script for deidentifying text using the doublet method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.


#!/usr/local/bin/perl
$begin = time();
open(TEXT,"doublets.txt")||die"cannot";
$line = " ";
while ($line ne "")
  {
  $line = $getline = <TEXT>;
  $getline =~ s/\n//;
  $doublethash{$getline}= "";
  }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time to create ";
print STDERR "the doublet hash is ";
print STDERR "$totaltime seconds.\n\n";
close TEXT;
$begin = time();
$/ = "\n\n";
open(TEXT,"aacom10.txt")||die"cannot";
open(STDOUT,">aacom10.out")||die"cannot";
$line = " "; $oldthing = ""; $state = 0;
while ($line ne "")
   {
   $line = <TEXT>;
   next if ($line eq "\n");
   #print "Original - $line" . "Scrubbed - " ;
   $line =~ s/\n$//;
   $line =~ s/\n/ /o;
   my @linearray = split(/ +/,$line);
   push (@linearray, "lastword");
   foreach $thing (@linearray)
     {
     $originalthing = $thing;
     $thing = lc($thing);
     $thing =~ tr/a-z\'\-//cd;
     if ($oldthing eq "")
        {
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        next;
        }
     $term = "$oldthing $thing";
     if (exists($doublethash{$term}))
        {
        print "$originaloldthing ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        $state = 1;
        next;
        }
     if ($state == 1)
        {
        if ($thing eq "lastword")
          {
          print $originaloldthing;
          print "\n\n";
          $oldthing = "";
          $state = 0;
          next;
          }
        print "$originaloldthing ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        $state = 0;
        next;
        }
     if ($state == 0)
        {
        if ($thing eq "lastword")
          {
          print "\*\.\n\n";
          $oldthing = "";
          next;
          }
        $punctuation = substr($originaloldthing,-1,1);
        if ($punctuation =~ /[a-zA-Z0-9]/)
           {
           $punctuation = "";
           }
        print "\*" . "$punctuation ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        next;
        }
     }
   }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following ";
print STDERR "doublet hash creation";
print STDERR " is $totaltime seconds.";
exit;

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, de-identification, doublet method, electronic medical record, medical scrubber, de-identification, doublet method, electronic medical record, medical scrubber, privacy, confidentiality

Thursday, January 17, 2008

Fast deidentifier that preserves punctuation

On Tuesday, Jan 17, 2008, I provided a very simple, fast, and almost perfect medical record de-identifier perl script . The script uses a public domain list of about 200,000 word doublets.

A public domain file shows the output this de-identifier with an input of about 15000 PubMed medical citations. PubMed citations are an excellent way to test de-identifiers because they are non copyrighted, they contain lots of medical vocabulary, and they are full of identiifiers (the names of the authors).

The provided output file does not preserve the punctuation in the original text.

It is easy to modify the Perl script to preserve case (lowercase, uppercase) and punctuation from the original text, and the output of the modified script is also available.

As with all my distributed scripts, the following disclaimer applies:

The perl script for deidentifying text using the doublet method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

The revised Perl script is shown here:


#!/usr/local/bin/perl
$begin = time();
open(TEXT,"doublets.txt")||die"cannot";
$line = " ";
while ($line ne "")
  {
  $line = $getline = <TEXT>;
  $getline =~ s/\n//;
  $doublethash{$getline}= "";
  }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time to create ";
print STDERR "the doublet hash is ";
print STDERR "$totaltime seconds.\n\n";
close TEXT;
$begin = time();
open(TEXT,"pathol5.txt")||die"cannot";
open(STDOUT,">pathol5.out")||die"cannot";
$line = " "; $oldthing = ""; $state = 0;
while ($line ne "")
   {
   $line = <TEXT>;
   next if ($line eq "\n");
   print "Original - $line" . "Scrubbed - " ;
   $line =~ s/\n$//;
   #$line =~ s/\n/ /o;
   my @linearray = split(/ +/,$line);
   push (@linearray, "lastword");
   foreach $thing (@linearray)
     {
     $originalthing = $thing;
     $thing = lc($thing);
     $thing =~ tr/a-z\'\-//cd;
     if ($oldthing eq "")
        {
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        next;
        }
     $term = "$oldthing $thing";
     if (exists($doublethash{$term}))
        {
        print "$originaloldthing ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        $state = 1;
        next;
        }
     if ($state == 1)
        {
        if ($thing eq "lastword")
          {
          print $originaloldthing;
          print "\n";
          $oldthing = "";
          $state = 0;
          next;
          }
        print "$originaloldthing ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        $state = 0;
        next;
        }
     if ($state == 0)
        {
        if ($thing eq "lastword")
          {
          print "\*\.\n";
          $oldthing = "";
          next;
          }
        $punctuation = substr($originaloldthing,-1,1);
        if ($punctuation =~ /[a-zA-Z0-9]/)
           {
           $punctuation = "";
           }
        print "\*" . "$punctuation ";
        $oldthing = $thing;
        $originaloldthing = $originalthing;
        next;
        }
     }
   }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following ";
print STDERR "doublet hash creation";
print STDERR " is $totaltime seconds.";
exit;

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, deidentified, deidentifier, hipaa, medical de-identifier, medical scrubber, scrubbed text

Wednesday, January 16, 2008

Minimum Necessary Provision (HIPAA) and deidentifer software

Most deidentification software (for medical records) does not actually deidentify records. The software simply reduces the number of identifiers in the records. This goes far to explain the complete absence of de-identified public domain medical record datasets, prepared by automatic deidentifiers and made available on the Web.

A large corpus of so-called de-identified records is certain to contain some HIPAA identifiers. In the U.S., the language in HIPAA would indicate that if you're reasonably certain that the data cannot be used to identify a protected individual, the data is exempted from HIPAA restrictions. But if you start off knowing that the data contains some HIPAA identifiers, you lose the reasonable expectation that no patient can be re-identified from the data!

What then, is the value of automatic de-identifiers that do not fully de-identify? In the U.S., HIPAA permits two methods by which an IRB can allow data that is not fully deidentified to be used for a variety of purposes. First is the Waiver. This permits the IRB to allow the uses of the data if certain conditions are met. Second is the Limited Use agreement, which permits a specified partner to receive data that is not fully de-identified, under certain conditions. In the U.S., the Common Rule, which applies to human subject research, permits IRB Waivers under a very similar set of conditions.

Even when the corpus of records is shared under a Waiver or a Limited Use agreement, the data must conform to the Minimum Necessary provision in HIPAA. When using identified information for permitted purposes, HIPAA requires that only the minimal amount of information needed for the purpose is disclosed (see the HIPAA excerpt, below). This would imply that information unrelated to research goals but included in medical reports, must be removed prior to transferring the reports to external covered entities. The Minimum Necessary provision applies to information other than identifying information.

-Section 164.514(d)--Minimum Necessary "covered entities must make reasonable efforts to use or disclose or to request from another covered entity, only the minimum amount of protected health information required to achieve the purpose of a particular use or disclosure."

Most de-identifier software cannot, in any way, help a data holder comply with the Minimum Necessary provision. However, the Concept-Match method, which blocks all text except for phrases that match terms contained in a medical nomenclature (such as the UMLS) or high frequency words (when, if, can, are, the, etc.) will block all text except for the "Minumum Necessary". To a somewhat lesser extent, the "doublet method" will do the same.

The strengths and limitations of the various types of de-identifiers now available as open source sotware are discussed in Biomedical Informatics and in Ruby Programming for Medicine and Biology.

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

Tuesday, January 15, 2008

Easy method for building public domain medical text corpus

In the last few blogs, I used a public domain corpus of medical citations to demonstrate an automatic scrubber (medical text deidentifier).

It is remarkably easy to create a large public domain text corpus for almost any medical specialty. All you need to do is to download a PubMed search by sending the search results to "file." If the search collects 50,000 citations, all of the citations will be sent to a file (that you name) on your own hard drive.

The Pubmed Search site is:

http://www.ncbi.nlm.nih.gov/pubmed/?

Titles and names are , according to the U.S. Copyright Office, excluded from copyright. PubMed citations, which consist of titles, names, and some annotation data (volume, pages, date), can be used freely. The same cannot be said for abstracts, which, as far as I can tell, can have copyright. The default "Summary" display (shown in image) produces a list of citations (without abstracts) when downloaded as a file.

-Jules Berman

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Perl implementation of doublet deidentifier

Here is the Perl code for implementing the doublet deidentifier (medical record scrubber).

It operates on a collection of over 15000 PubMed Citations (author line and title line), and uses a publicly available external list of "safe" doublets. A plain-text file of doublets is available.

The entire output of the script is available for review.

As with all my distributed scripts, the following disclaimer applies:

The perl script for deidentifying text using the doublet method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.


#!/usr/local/bin/perl
$begin = time();
open(TEXT,"doublets.txt")||die"cannot";
$line = " ";
while ($line ne "")
  {
  $line = $getline = <TEXT>;
  $getline =~ s/\n//;
  $doublethash{$getline}= "";
  }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following to create ";
print STDERR "the doublet hash is ";
print STDERR "$totaltime seconds.\n\n";
close TEXT;
$begin = time();
open(TEXT,"pathol5.txt")||die"cannot";
open(STDOUT,">pathol5.out")||die"cannot";
$line = " "; $oldthing = ""; $state = 0;
while ($line ne "")
   {
   $line = <TEXT>;
   next if ($line eq "\n");
   print "Original - $line" . "Scrubbed - " ;
   $line =~ s/[\,\.\n]//g;
   $line = lc($line);
   my @linearray = split(/ /,$line);
   push (@linearray, "lastword");
   foreach $thing (@linearray)
     {
     if ($oldthing eq "")
        {
        $oldthing = $thing;
        next;
        }
     $term = "$oldthing $thing";
     if (exists($doublethash{$term}))
        {
        print "$oldthing ";
        $oldthing = $thing;
        $state = 1;
        next;
        }
     if ($state == 1)
        {
        if ($thing eq "lastword")
          {
          print $oldthing;
          print "\.\n";
          $oldthing = "";
          $state = 0;
          next;
          }
        print "$oldthing ";
        $oldthing = $thing;
        $state = 0;
        next;
        }
     if ($state == 0)
        {
        if ($thing eq "lastword")
          {
          print "\*\.\n";
          $oldthing = "";
          next;
          }
        print "\* ";
        $oldthing = $thing;
        next;
        }
     }
   }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following ";
print STDERR "doublet hash creation";
print STDERR " is $totaltime seconds.";
exit;

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

Jules J. Berman, Ph.D., M.D.
tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, de-identification, deidentification, doublet method, medical scrubber, Perl script

Monday, January 14, 2008

Medical record de-identifier (using the doublet method)

Earlier today, I wrote a blog describing an identifier-free doublet list. This blog describes a scrubber use-case for the doublet list.

The de-identification of medical text is a good use of the doublet list. To share medical records (usually for the purposes of research) it is often imporant to remove all of the identifiers (that could link a patient to the record).

There are now available a variety of medical text scrubbers for this purpose. Most require the users to develop identifier lists for their site (list of patient names, doctor names, etc), run very slowly (typically, about one record per second), and do not remove anywhere close to all of the identifiers.

Not so for the doublet method. A doublet scrubber parses through any text, matching doublets from the text against an external identifier-free doublet list, preserving all matching doublets from the text, and blocking all non-matching words with an asterisk. If your list of doublets contains no identifiers, the scrubbed output should be perfectly de-identified. Though perfection can never be guaranteed, I have never encountered any "missed" identifiers in a text that was parsed under these conditions. A public domain list of doublets is available , but I cannot guarantee that the list is identifier-free or that it is the best list for your purposes. Feel free to modify the list, add to the list, or create your own list of identifier-free doublets.

The doublet method is described in Ruby Programming for Medicine and Biology.

For each citation, the list of authors is put on a line, and is immediately followed by its scrubbed version on the next line. Then the title of the article is put on the next line, followed by the scrubbed version of the title of the article. This pattern is repeated for the 1500+ citation.

The doublet scrubber is small (just a few dozen lines of code) and fast. It took approximately 2 seconds to parse the 15000 citations using a Perl script with access to a list of about 200,000 identifier-free doublets. I used my home computer (2.8 GHz, 512 MByte RAM). This is a scrubbing rate of 1 MegaByte per second. At this speed, a 1 GByte file could be parsed in about 15 minutes. It can parse a 1 Terabyte file in about a week. Large hospitals produce about 1 Terabyte of data each week, so this scrubber can, for now, "keep up" with the vast load of data produced by many hospitals (using a modest desktop computer).

The only limitation that I have found with the doublet scrubber is that it scrubs too much, blocking all doublets not found in the external doublet list. You can be the judge by reviewing the provided output file. The output attached here can be used to assess the effectiveness of the doublet method of text scrubbing.

-Jules Berman tags: common rule, data scrubbing, de-identification, deidentification, hipaa, medical records

Parsable Doublets List now available in public domain

Word doublets are two-word phrases that appear in text (i.e., they are not randomly chosen two-word sequences.

Doublets can be used in a variety of informatics projects: indexing, data scrubbing, nomenclature curation, etc. Over the next few days, I will provide examples of doublet-based informatics projects.

A list of over 200,000 word doublets is available for download.

The list was generated from a large narrative pathology text. Thus, the doublets included here would be particularly suitable for informatics projects involving surgical pathology reports, autopsy reports, pathology papers and books, and so on.

The Perl script that generated the list of doublets by parsing through a text file ("pathold.txt"), is shown:


#!/usr/local/bin/perl
open(TEXT,"pathold.txt")||die"cannot";
open(OUT,">doublets.txt")||die"cannot";
undef($/);
$var = <TEXT>;
$var =~ s/\n/ /g;
$var =~ s/\'s//g;
$var =~ tr/a-zA-Z\'\- //cd;
@words = split(/ +/, $var);
foreach $thing (@words)
  {
  $doublet = "$oldthing $thing";
  if ($doublet =~ /^[a-z]+ [a-z]+$/)
    {
    $doublethash{$doublet}="";
    }
  $oldthing = $thing;
  }
close TEXT;
@wordarray = sort(keys(%doublethash));
print OUT join("\n",@wordarray);
close OUT;
exit;

You can generate your own list by substituting any text file you like for "pathold.txt". Keep in mind that the Perl script slurps the entire text file into a string variable, so the script won't work if you use a file that exceeds the memory of the computer. For most computers (with RAM memories that exceed 256 MBytes) this will not be a problem. On my computer (about 2.8 GHz and 512 Mbyte RAM) the script takes about 5 seconds to parse a 9 Megabyte text file).

Since the doublet list below consists of a non-narrative collection of words, it cannot be copyrighted (i.e., it is distributed as a public domain file).

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, biomedical informatics, curation, data scrubbing, deidentification, medical nomenclature, Perl script, public domain, doublets list

Sunday, January 13, 2008

Complex Hospital Information Systems (HISs)

This is one of a series of blogs on hospital information technology, emphasizing limitations and pitfalls.

The following is a short excerpt from Biomedical Informatics.

Beginning of excerpt:

Much can be learned from documented technology disasters. A 2003 article in the British Medical Journal described a project to install a computerized integrated hospital information system in Limpopo (Northern) Province of South Africa (1). This poor province with 42 hospitals had invested heavily to acquire the system. This fascinating article describes what went wrong and provides a list of factors that led to the failure of the system. this included a failure to take into account the social and cultural milieu in which the system would be used. There was an underestimation of the complexity the undertaking and insufficient appreciation of the length of training required by the hospital staff.

Failed system implementations occur in the U.S. The Veterans Administration spent hundreds of millions of dollars on a financial tracking system. The software implementation failed during trials at the Bay Pines VA in Tampa Bay Florida. Plans for extending the system to other VA hospitals became the subject of Congressional hearings (2).

End of excerpt

When an IT disaster occurs at a major medical center, there is a tendency to hush up the news. I have been told, in confidence, of other expensive systems that failed to implement or that implemented in a way that most users considered unsatisfactory. My perception is that HIS disasters are actually common, but I have no way of proving it.

There are many blogs in which health IT problems are openly discussed. Two of these are are:

http://labsoftnews.typepad.com/

http://www.histalk2.com/

References:

1. Littlejohns P, Wyatt JC, Garvican L. Evaluating computerised health information systems: hard lessons still to be learnt British Medical Journal 326:860-863, April 19, 2003.
2. Untangling the VA computer crash: How Bay Pines hospital went from guinea pig to paralyzed victim because of pressure to implement a new system. St. Petersburg Times Tampa Bay. March 28, 2004. Comment. The Veterans Administration spent nearly $500 million on a failed computer system.

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining

Saturday, January 12, 2008

Computerized physician order entry (CPOE)

This is one of a series of blogs on hospital information technology, emphasizing its limitations and pitfalls.

The following is a short excerpt from Biomedical Informatics.

One of the most challenging features of many HISs is computerized physician order entry (CPOE). The intent of CPOE is to eliminate the wasteful hand-written (often illegible) doctor's orders that may need to be transcribed by nurses, pharmacists, and laboratory personnel before finally entered into the HIS. Having the physicians directly enter their orders into the HIS has been a long-awaited dream for many hospital administrators. In a fascinating report, patient mortality was shown to increase after implementation of CPOE.

[Han YY, Carcillo JA, Venkataraman ST, Clark RS, Watson RS, Nguyen TC, Bayir H, Orr RA. Unexpected increased mortality after implementation of a commercially sold computerized physician order entry system. Pediatrics 116:1506-1512, 2005. text of article]

In this study, having CPOE was a strong, independent predictor of patient death. Somehow, a computerized service intended to enhance patient care had put patients at increased risk.

Without commenting on this particular study, it may be useful to review some factors that transform CPOE, and other well-intended medical informatics efforts, into destructive technologies.

Reasons why hospital informatics projects, such as CPOE, may fail.

-Tasks that were traditionally accomplished through intepersonal communication may be replaced by solitary entry sessions with HIS computer terminals. Opportunities to share helpful explanations and patient status updates may be lost.

-Computer entry tasks may be tedious, time-consuming, and repetitive. Harried staff, under these circumstances, may do an incomplete or sloppy job.

-Computer orders, once entered, may have no mechanism for correcting entry errors, leading to miscommunications.

-The asynchronous nature of multi-user entries into the HIS may cause havoc in a system that depends on coordinated workflow. For instance a prescription may not be filled by pharmacy until an order entered by a clerk-typist is released by a physician. If there is no system to ensure that each entry occurs in a timely and coordinated manner, workflow is halted.

Beyond informatics issues lie all-important social issues. High-tech medical solutions seldom achieve a desired effect for low-tech medical staff. Introducing complex informatics services, such as CPOE, requires staff training. There needs to be effective communication between the clinical staff and the hospital IT staff and between the hospital IT staff and the HIS vendor staff. Everyone involved must cooperate until the implemented system is working smoothly.

-Jules Berman

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, genetics of complex disease, genetics of common diseases, cryptic disease

Friday, January 11, 2008

Does the EHR really help?

Yesterday, I posted a blog that discussed a white paper indicating that the most important causes of hospital errors (leading to death) were: Failure to Rescue, Decubitus Ulcer, and Post-operative Sepsis. Most of the causes of medical error described in the article seemed to have very little to do with hopital information systems or the electronic health record (EHR).

Another interesting article, recently published, is:

Electronic Health Record Use and the Quality of Ambulatory Care in the United States. Jeffrey A. Linder, MD, MPH; Jun Ma, MD, RD, PhD; David W. Bates, MD, MSc; Blackford Middleton, MD, MPH, MSc; Randall S. Stafford, MD, PhD. Arch Intern Med. 2007;167:1400-1405.

In this article , the authors found that "Electronic health records were used in 18% (95% confidence interval [CI], 15%-22%) of the estimated 1.8 billion ambulatory visits (95% CI, 1.7-2.0 billion) in the United States in 2003 and 2004. For 14 of the 17 quality indicators [that the authors examined], there was no significant difference in performance between visits with vs without EHR use."

In the rush to adopt the EHRs in American medical care, it might be worthwhile to think about those conditions for which the EHR may not be a particular benefit, and those conditions for which EHRs may be a disruptive technology carrying extra risk, and those conditions for which rapid adoption of EHRs is greatly beneficial (if not crucial).

In the next few days, I'll try to write additional blogs on this subject.

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

Thursday, January 10, 2008

Failure to rescue

In a fascinating white paper published in 2004, Health Grades, Inc. reached the following conclusions from a large review of MedPar data (medicare records):

"Approximately 1.14 million total patient safety incidents [PSIs] occurred among the 37 million hospitalizations in the Medicare population from 2000 through 2002."

"The PSIs with the highest incident rates per 1,000 hospitalizations at risk were Failure to Rescue, Decubitus Ulcer, and Post-operative Sepsis. These three patient safety incidents accounted for almost 60% of all patient safety incidents among Medicare patients hospitalized from 2000 through 2002."

"The 16 PSIs studied accounted for $8.54 billion in excess inpatient cost to the Medicare system over 3 years, or roughly $2.85 billion annually. Decubitus Ulcer ($2.57 billion), Post-operative Pulmonary Embolism or Deep Vein Thrombosis ($1.40 billion), and Selected Infections due to Medical Care ($1.71 billion) were the most costly and accounted for 66% of all excess attributable costs from 2000 through 2002."

Failure to rescue usually occurs in a setting where a patient starts to develop signs and symptoms that are unevaluated clinically by the medical staff. Maybe the patient has a small rise in temperature, or maybe the patient has a sudden chest pain that is unaccompanied by ECG changes, or unusual leg pain, or maybe the patient has a little GI upset, or seems a bit agitated. It takes a great deal of judgment to react wisely when patients develop unexpected changes in physical or mental status.

Still, small problems can easily lead to big problems in a medical setting, and big problems can lead to death. Often, especially after the small problems have gotten out of hand, the response time by the clinical staff is crucial.

There was an excellent report on automatic defibrillators in the New York Times, Jan 3, 2008, by Denise Grady, entitled, "Hospitals Slow in Heart Cases, Research Finds." The author described a Failure to Rescue scenario that occurs commonly in hospitals. A patient suffers a heart attack and a consequent arrhythmia that could be reversed with defibrillation if received in under two minutes. Many hospitals cannot respond with defibrillation within the two minute window. The reasons are systemic and may include policies that forbid floor nurses to defibrillate.

In contrast to hospitals, automatic debribrillators that are kept at ball parks, health clubs, and department stores, permit laypersons to defibrillate because they come with automatic sensors that determine if the patient has a heart rhythm that can be rescued with the defibrillator. The upshot of the NY Times article was that your chance of receiving life-saving defibrillation may be higher at a ballgame (where people witness your event and a defibrillator is quickly available) than in a hospital setting.

- Jules Berman

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, pathology, critical period, failure to save, medical errors

Wednesday, January 9, 2008

NIH-funded investigators must make their publications open access

As of Dec. 27, 2007, investigators funded by NIH must submit a copy of their accepted research articles to PubMed Central, for open access publication.

Excerpt from publication policy (inside the ominbus spending bill):

"The Director of the National Institutes of Health shall require that all investigators funded by the NIH submit or have submitted for them to the National Library of Medicine's PubMed Central an electronic version of their final, peer-reviewed manuscripts upon acceptance for publication to be made publicly available no later than 12 months after the official date of publication: Provided, That the NIH shall implement the public access policy in a manner consistent with copyright law."

In May 2005, a new open access policy at NIH asked NIH-funded investigators to submit their publications for open access publication, but the earlier policy was voluntary.

How did scientists respond? In Feb. 2006, the NIH released data showing that grantee compliance with NIH's open access policy was below 4%.

Now, cooperation is a legal requirement for U.S. funded investigators. Other countries have preceded the U.S. with similar policies. It seems only fair that taxpayers get to see what they've paid for.

Source:
BiomedCentral article

-Jules Berman tags: funding, investigators, nih, open access, public access, science

Tuesday, January 8, 2008

Rhabdoid tumors, a sui generis class in the Neoplasm Classification

Rhabdoid tumors occur in just a few dozen children each year in the U.S. These aggressive tumors arise from brain (class Neuroectoderm in the Neoplasm Classification) and from kidney (class Mesoderm in the Neoplasm Classification) and contain morphologically distinctive cells (so-called rhabdoid cells, named for their superficial similarity to muscle cells).

Not long ago, the rhabdoid tumors that arose in the kidney were thought to be different from the rhabdoid tumors that arose from the brain. The common rhabdoid cell was considered to be a peculiar morphologic variant that just happened to occur in both types of tumors. The kidney tumor was called MRT (malignant rhabdoid tumor). MRT was considered, by many pathologists, to be a variant form of nephroblastoma. The rhabdoid brain tumor was called AT/RT (Atypical teratoid rhabdoid tumor), and some pathologists classified AT/RT among the PNETs.

The perceived distinction between MRT and AT/RT started to disappear apart when it was found that about 10% of patients with MRT also developed AT/RT or so-called primitive neuroectodermal neoplasm of brain. Recently, a characteristic genetic abnormality has been found in rhabdoid tumors of CNS or renal origin: bi-allelic loss of INI1 gene expression. Immunostaining for the protein produced by the INI1 gene is absent in almost all reported cases of rhabdoid tumor cells.

Rhabdoid cells are large cells with an eosinophilic cytoplasm. Ultrastructural examination of rhabdoid cells shows characteristic whorls of intermediate filaments. Intermediate filaments are proteins contribute to the the structure of cells and provide resistance to deformity. Normal cells contain intermediate filaments that are specific for their developmental lineage. Cells of endodermal or ectodermal origin contain cytokeratin filaments. Cell of Neuroectodermal origin contain neurofilaments filaments. Cells of mesenchymal origin contain desmin filaments. Rhabdoid cells contain all these different types of intermediate filaments.

Rhabdoid tumor apparently disobey some of the rules of neoplastic development.

-Rhabdoid tumors are lineage non-specific and can arise from neuroectodermal cells or mesodermal cells. All other tumors of somatic cells (non-germ cells) arise from a single germ lineage.

-Cells within a single rhabdoid tumor seem to have differentiated along several developmental lineages (ectodermal, endodermal, neuroectodermal and mesodermal). This phenomenon is otherwise restricted to totipotent germ cell tumors.

-Cells within a single rhabdoid tumor may include primitive cells indistinguishable from PNET tumors (primitive neuroectodermal tumors). PNET tumors are typically monomorphic tumors.

-Rhabdoid tumors are all associated with a specific phenotypic cell (the rhabdoid cell) regardless of the developmental origin of the tumor (Neuroectoderm or Mesoderm).

-The rhabdoid cells contain several different intermediate filaments. In normal cells, only one type of intermediate filaments is found in any single cell, and that filament is specific for the lineage of the cell.

-Almost all tumors arise from cells that resemble an observable normal cell. For example, a squamous cell carcinoma is composed of cells that resemble normal squamous cells biochemically, ultrastructurally, and by light microscopic examination. The rhabdoid cell has no known counterpart in any adult tissue or in any stage of development.

-Rhabdoid tumors seem to be caused by bi-allelic loss of the tumor suppressor gene, INI1. Tumor suppressor genes are normally associated with different types of tumors. INI1 gene loss seems to be exclusively associated with rhabdoid tumors or with rhabdoid tumor cell subpopulations developing within other types of tumors. Currently, there is no other known genetic mutation that produces a specific phenotype akin to the association between INI1 and rhabdoid cells.

-INI1 loss produces rhabdoid tumors in mice. Eight of 125 mice with germline haploid complement of INI (Snf5+/- mice) developed INI1-negative tumors of soft tissue origin and rhabdoid cell morphology RrobbR. haploid. The mouse tumor is morphologically and genetically identical to the human tumor. Despite the phenotype and genotypic similarities between murine and human rhabdoid tumors, the mouse tumor arises from the branchial arch soft tissue, a tissue of origin not observed in human rhabdoid tumors.

How is this possible? How can a tumor suppressor gene mutation in a non-germ cell produce tumors that arise from tissues of different developmental lineage and contain a common tumor cell that contains intermediate filaments specific for multiple cell lineages?

The INI1 gene codes a subunit of the SWI/SNF chromatin remodelling complex. The study of SWI/SNF complexes in mammals, flies and plants suggests that they strongly influence many developmental pathways RkwoaR. This suggests two possible mechanisms for the action of the INI1 tumor suppressor gene in rhabdoid tumors

-1. The cell of origin of rhabdoid tumors is a primitive, pluripotent somatic stem cell with a lineage position that precedes the development of germ layers.

-2. Loss of INI1 gene expression produces a pluripotent stem cell when it occurs in cells of several different lineages (e.g., neuroectodermal or endodermal or mesodermal cells).

Of these two possibilities, the first seems unlikely because subpopulations of INI1-negative rhabdoid cells occur secondarily within adult tumors, and this would not be expected to occur if the INI1 mutation only produced rhabdoid tumors derived from primitive cells. Also, if there were a very primitive cell (with a lineage that precedes the development of the germ layers), why would the INI1 mutation be the only mutation that could produce tumors of these cells?

The second possibility, if true, might explain the odd biological features of rhabdoid tumors. The idea of mutations in differentiated cells conferring pluripotentiality is not unprecedented. In recent work, Yamanaka and coworkers successfully created pluripotent stem cells from differentiated fibroblasts by introducing (with retroviruses) several chosen genes (Oct3/4, Sox2, c-Myc and Klf4) and subsequent selection for Fbx15 RokiaR.

See: Okita K, Ichisaka T, Yamanaka S. Generation of germline-competent induced pluripotent stem cells. Nature 448:313-317 2007.

These experiments demonstrated that differentiated cells could be altered to become stem cells and provides researchers with a source of stem cells other than embryos.

It would seem that regardless of the lineage of origin, INI1 biallelic loss creates tumors of a unique phenotype.

As a result, I'm changing the class for rhabdoid tumors within the Neoplasms Classification. It is now a one-of-a-kind tumor in its own class, placed under the Neoplasm superclass.

The updated Neoplasm Classification is available as a gzipped XML file at:

http://www.julesberman.info/neoclxml.gz

An early paper on the Neoplasm Clasification is available as an open source document.

Monday, January 7, 2008

Possessive forms of eponymous neoplasms

Many diseases carry the name of a person. In the past,it was considered proper to specify eponymous diseases using the possessive form. For example, Hodgkin's lymphoma, Kaposi's sarcoma, Warthin's tumor, and so on. More recently, the fashion has been to use the non-possessive surname. During a 1975 meeting to standardize the nomenclature of malformations, a recommendation was made to use "Down Syndrome" not "Down's Syndrome." The argument was made that "The possessive use of an eponym should be discontinued, since the author neither had nor owned the disorder" [1]. The same applies to neoplasms, and the non-possessive form of eponymous tumors is should probably be used (e.g., Hodgkin disease or Hodgkin lymphoma, Kaposi sarcoma, Warthin tumor).

More confusing, to my mind, is the incosistent way we deal with adjectival forms of eponymous terms.

So, it's Brownian motion (not Brown's motion or Brown motion), and Abelian groups (Not Abel's groups or Abel groups) and Darwinian evolution (not Darwin's evolution or Darwin evolution).

In medicine, it gets even weirder. For example, consider the adjective, "Kaposiform". There are at least two "Kaposi" disorders. Kaposi sarcoma (an angiosarcoma most often seen nowadays in immunosuppressive syndromes) and Kaposi varicelliform eruption, a now rare complication of vaccinia superimposed on atopic dermatitis with high fever and generalized vesicles and papulovesicles. We use the word "kaposiform", but what exactly could this word mean given the mutliple conditions associated with the name "Kaposi?"

Another strange adjective, used by pathologists, is rubriform, meaning red. Red is a color, not a form. So rubriform is an inconsistent metaphor within a single word. Some might prefer the term "reddish," but I doubt that reddish is really a word. It seems to me that the word "red" should suffice when "reddish" or "rubriform" would only confuse.

1. Committee report. "Classification and nomenclature of morphological defects (Discussion)". The Lancet 305:513, 1975.

-Jules Berman

Sunday, January 6, 2008

Concept-match deidentification

I have just uploaded the paper that fully describes the concept-match method for medical record de-identification. This version is modified from the original publication with URL updates that correctly link to currently available supplementary resources.

The properties of the concept match method are:

It produces an output devoid of phrases that do not map to a reference terminology.

It substitutes synonymous medical terms for the original terms contained in the text, thus making it difficult for someone with access to diagnostic terms found in the original report to match text in the output record (another type of attack on confidentiality).

It maintains the original order of terms in sentences, preserving standard stop words. This integrity allows readers (and computer parsers) of scrubbed text to construct grammatical (logical) relationships between output terms in scrubbed sentences.

It provides an output stripped of nonmedical and extraneous information, in keeping with HIPAA recommendations that covered entities restrict transfers of medical information to the minimum necessary to accomplish its purpose.

It provides the terminology code for each medical term included in the sentence, making it possible to index terms and to relate terms to ancestor and descendant terms listed in biomedical ontologies.

It does its job quickly. High-throughput techniques are required to handle large volumes of data.

Also, distributed with the Concept-Match paper is the JHARCOLL list of text phrases from surgical pathology reports. The JHARCOLL file is freely distributed as a tarballed, gzipped file, from:

http://www.julesberman.info/jharcoll.tar.gz

It contains about 568,000 medical phrases that can be used in a variety of informatics projects.

Here is a small excerpt, of consecutive phrases taken directly from the jharcoll file:

drug induced colitis
drug induced damage
drug induced disease
drug induced enteritis
drug induced erosion
drug induced esophagitis
drug induced etiology
drug induced febrile
drug induced forms
drug induced gastric injury
drug induced gastric ulcers
drug induced gastritis
drug induced gingival hypertrophy
drug induced granulomas
drug induced granulomatous
drug induced granulomatous disease
drug induced granulomatous hepatitis
drug induced gut
drug induced gut lesions
drug induced hepatic
drug induced hepatic granulomas
drug induced hepatitis
drug induced hypersensitivity reaction
drug induced immune reaction
drug induced inflammatory disease
drug induced injury
drug induced interstitial
drug induced interstitial lung disease
drug induced interstitial nephritis
drug induced interstitial nephritis clearly
drug induced intestinal inflammatory disease
drug induced intrahepatic cholestasis
drug induced lesion
drug induced lesions
drug induced liver
drug induced liver disease
drug induced liver injury
drug induced lung
drug induced lung disease
drug induced lung injury
drug induced lupus
drug induced lupus erythematosus
drug induced marrow depression
drug induced mucosal injury
drug induced myocarditis
drug induced nephritis
drug induced neutropenia
drug induced pancytopenia
drug induced process
drug induced reaction
drug induced submassive necrosis
drug induced thrombocytopenia
drug induced thrombotic
drug induced ulcer
drug induced ulceration
drug induced ulcers
drug induced vascular disease
drug induced vasculitis
drug induced veno occlusive disease
drug induced vs
drug indused
drug infusion instrument
drug ingestion
drug ingestion aside
drug ingestion history
drug injestion
drug injuries
drug injury
drug intake
drug levels
drug nephrotoxicity
drug nephrotoxicity caused
drug ointment
drug pigmentation
drug presence
drug rash
drug rash versus gvhd
drug reaction
drug reaction given
drug reaction viral exanthem
drug reaction vs
drug reaction vs gvh
drug reactions
drug reactions might
drug recently
drug regimen
drug residue
drug rx
drug rx toxicity
drug rxn
drug stress
drug therapy
drug toxic
drug toxicity

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

Deidentification with one-way hash algorithms

A one-way hash is an algorithm that transforms a string into another string is such a way that the original string cannot be calculated by operations on the hash value (hence the term "one-way" hash). Examples of public domain one-way hash algorithms are MD5 and SHA (Standard Hash Algorithm) [1,2]. These differ from encryption protocols that produce an output that can be decrypted by a second computation on the encrypted string.

The resultant one-way hash values for text strings consist of near-random strings of characters, and the length of the strings (e.g. the strength of the one-way hash) can be made arbitrarily long. Therefore name spaces for one-way hashes can be so large that the chance of hash collisions (two different names or identifiers hashing to the same value) is negligible. For the fussy among us, protocols can be implemented guaranteeing a dataset free of hash-collisions, but such protocols may place restrictions upon the design of the dataset (e.g. precluding the accrual of records to the dataset after a certain moment)

In theory, one-way hashes can be used to anonymize patient records while still permitting researchers to accrue data over time to a specific patient' record. If a patient returns to the hospital and has an additional procedure performed, the record identifier, when hashed, will produce the same hash value held by the original dataset record. The investigator simply adds the data to the "anonymous" dataset record containing the same one-way hash value. Since no identifier in the experimental dataset record can be used to link back to the patient, the requirements for anonymization, as stipulated in the E4 exemption are satisfied (vida supra).

The use of one-way hashes to anonymize patient records has been employed and promoted in France. Quantin and Bouzelat have standardized a protocol for coding names using SHA one-way hashes [3]. There is no practical algorithm that can take an SHA hash and determine the name (or the social security number or the hospital identifier, or any combination of the above) that was used to produce the hash string. In France, the name-hashed files are merged with files from many different hospitals and used in epidemiologic research. They use the hash-codes to link patient-data across hospitals.

Implementation of one-way hashes carry certain practical problems. Attacks on one-way hash data may take the form of hashing a list of names and looking for matching hash values in the dataset. This can be solved by encrypting the hash or by hashing a secret combination of identifier elements or both or keeping the hash value private (hidden). Issues arise related to the multiple ways that a person may be identified within a hospital system (Tom Peterson on Monday, Thomas Peterson on Tuesday), all resulting on inconsistent hashes on a single person. Resolving these problems is an interesting area for further research.

1. R. Rivest, Request for Comments: 1321, The MD5 Message-Digest Algorithm
http://theory.lcs.mit.edu/~rivest/Rivest-MD5.txt

2. World Wide Web Consortium. SHA-1 Digest.
http://www.w3.org/TR/1998/REC-DSig-label/SHA1-1_0

3. H. Bouzelat, C. Quantin, L. Dusserre. Extraction and anonymity protocol of medical file. Proc AMIA Annu Fall Symp (1996) 323-327.

See also my article on one-way hash issues under HIPAA.

-Jules J. Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

Saturday, January 5, 2008

Zipf law for surgical pathology

In almost every segment of life, a small number of items usually account for the bulk of the observed activities. Though there are millions of authors, a relatively small number of authors account for the bulk of the books sold (think J.K. Rowling). A small number of diseases account for the bulk of deaths (think cardiovascular disease and cancer). A few phyla account for the bulk of the diversity of animals on earth (think arthropods). A few hundred words account for the bulk of all word occurrences in literature (think in, be, a, an, the, are). This phenomenon was observed and described by George Kingsley Zipf, who devised Zipf's law as a mathematical description. Wikipedia has an excellent discussion of Zipf's law.

Zipf's law applies to the diagnoses rendered in a pathology department. I helped write an early paper wherein three years' worth of surgical pathology reports, for a a university-associated hospital, were collected and reviewed.

There were 64,921 diagnostic entries (averaging 1.6 SNOMED codes per specimen and 1.4 specimens per patient), that were accounted for by 1,998 different morphologic diagnoses. A mere 21 diagnostic entities accounted for 50% of the code occurrences. 265 entities accounted for 90% of the code occurrences, indicating that the diagnostic efforts of pathology departments are primarily devoted to a small fraction of the many thousands of described pathologic entities.

This paper, published in 1994, is available for review

-Jules J. Berman

I urge you to read more about my book. There's a generous preview of the book at the Google Books site.

tags: common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, pathology, anatomic pathology, medical nomenclature

Friday, January 4, 2008

Google preview of Perl Programming for Medicine and Biology

"Google Books" has chosen my work, Perl Programming for Medicine and Biology (2007), for limited preview.

You can browse the table of contents and read selected excerpts from the book. This comes about as close as in in-store browse as anyone might want, and I'm very grateful that Google provides this service. I have no idea how they choose which books get previewed, but I wish they would do it for every book-in-print.

-Jules Berman

Medical Abbreviations

A public domain list of over 12,000 medical abbreviations is available at:

www.julesberman.info/abbtwo.htm

The text that discusses the list and describes the different classes of medical abbreviations is found at:

www.julesberman.info/abb1.htm

This document also describes computational approaches to parsing and disambiguating abbreviations found in medical text.

The abstract for the document is shown here:

Berman JJ. Pathology Abbreviated: A Long Review of Short Terms. Archives of Pathology and Laboratory Medicine, 128:347-352, 2004.

Context.—Abbreviations are used frequently in pathology reports and medical records. Efforts to identify and organize free-text concepts must correctly interpret medical abbreviations. During the past decade, the author has collected more than 12000 medical abbreviations, concentrating on terms used or interpreted by pathologists.

Objective.—The purpose of the study is to provide readers with a listing of abbreviations. The listing of abbreviations is reviewed for the purpose of determining the variety of ways that long forms are shortened.

Design.—Abbreviations fell into different classes. These classes seemed amenable to distinct algorithmic approaches to their correct expansions. A discussion of these abbreviation classes was included to assist informaticians who are searching for ways to write software that expands abbreviations found in medical text. Classes were separated by the algorithmic approaches that could be used to map abbreviations to their correct expansions. A Perl implementation was developed to automatically match expansions with Unified Medical Language System concepts.

Measurements.—The abbreviation list contained 12097 terms; 5772 abbreviations had unique expansions. There were 6325 polysemous abbreviation/expansion pairs. The expansions of 8599 abbreviations mapped to Unified Medical Language System concepts. Three hundred twenty-four abbreviations could be confused with unabbreviated words. Two hundred thirteen abbreviations had different expansions depending on whether the American or the British spellings were used. Nine hundred seventy abbreviations ended in the letter “s.”

Results.—There were 6 nonexclusive groups of abbreviations classed by expansion algorithm, as follows: (1) ephemeral; (2) hyponymous; (3) monosemous; (4) polysemous; (5) masqueraders of common words; and (6) fatal (abbreviations whose incorrect expansions could easily result in clinical errors).

Conclusion.—Collecting and classifying abbreviations creates a logical approach to the development of class-specific algorithms designed to expand abbreviations. A large listing of medical abbreviations is placed into the public domain. The most current version is available at http://www.julesberman.info/abbtwo.htm