The de-identification of medical text is a good use of the doublet list. To share medical records (usually for the purposes of research) it is often imporant to remove all of the identifiers (that could link a patient to the record).
There are now available a variety of medical text scrubbers for this purpose. Most require the users to develop identifier lists for their site (list of patient names, doctor names, etc), run very slowly (typically, about one record per second), and do not remove anywhere close to all of the identifiers.
Not so for the doublet method. A doublet scrubber parses through any text, matching doublets from the text against an external identifier-free doublet list, preserving all matching doublets from the text, and blocking all non-matching words with an asterisk. If your list of doublets contains no identifiers, the scrubbed output should be perfectly de-identified. Though perfection can never be guaranteed, I have never encountered any "missed" identifiers in a text that was parsed under these conditions. A public domain list of doublets is available , but I cannot guarantee that the list is identifier-free or that it is the best list for your purposes. Feel free to modify the list, add to the list, or create your own list of identifier-free doublets.
The doublet method is described in Ruby Programming for Medicine and Biology.
For each citation, the list of authors is put on a line, and is immediately followed by its scrubbed version on the next line. Then the title of the article is put on the next line, followed by the scrubbed version of the title of the article. This pattern is repeated for the 1500+ citation.
The doublet scrubber is small (just a few dozen lines of code) and fast. It took approximately 2 seconds to parse the 15000 citations using a Perl script with access to a list of about 200,000 identifier-free doublets. I used my home computer (2.8 GHz, 512 MByte RAM). This is a scrubbing rate of 1 MegaByte per second. At this speed, a 1 GByte file could be parsed in about 15 minutes. It can parse a 1 Terabyte file in about a week. Large hospitals produce about 1 Terabyte of data each week, so this scrubber can, for now, "keep up" with the vast load of data produced by many hospitals (using a modest desktop computer).
The only limitation that I have found with the doublet scrubber is that it scrubs too much, blocking all doublets not found in the external doublet list. You can be the judge by reviewing the provided output file. The output attached here can be used to assess the effectiveness of the doublet method of text scrubbing.
-Jules Berman tags: common rule, data scrubbing, de-identification, deidentification, hipaa, medical records
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.