Thursday, February 28, 2008

A fast (combined) medical autocoder and scrubber

In today's blog, I discuss a newly loaded public domain file that contains the combined autocoded and scrubbed output for 95,260 PubMed Citations (computed in under a minute).

In the field of biomedical informatics, the term "scrubbing" refers to removing patient identifiers from confidential medical records. The term "autocoding" refers to extracting medical terms from text and providing terms with a concept code contained in a nomenclature.

I have prepared a public domain corpus of 95,260 PubMed citations that have been autocoded using the Neoplasm Classification. The Neoplasm Clasification is available as a gzipped xml file . All of the named neoplasms and all of the general non-specific terms for neoplasms (such as the word, "tumor") have been automatically extracted from the text.

In addition, all of the citations have been de-identified. Words that might be identifiers are replaced by an asterisk.

On a web site, I have listed the first thousand entries in the file , just so that you get an idea of what a sample output might look like.

If you are curious about autocoding or in medical record scrubbing (also called de-identification), you should visit two of my other web sites, that discuss these two topics in greater detail.

Autocoding (topic)


Medical Data Scrubbing (topic)

The automatic coder and scrubber consists of a few dozen lines of Perl code. The file that is coded and scrubbed contains 95,260 PubMed Citations and has a length of over 10 Megabytes. Autocoding and scrubbing took under a minute on a modest 2.8 GHz desktop computer with 512 Mbytes of RAM. This is a rate of about 200 Kilobytes per second.

The entire input file and the entire output file are available as gzipped text files, both available from my website:

Input text file (10 Megabytes expanded)


Output autocoded and deidentified file (25+ Megabytes expanded)

They are public domain documents.

You can check for yourself the accuracy of the scrubber and autocoder. You will find that virtually no names of neoplasms were missed and that virtually no identifiers were left in the scrubbed text.

Medical autocoding and medical record scrubbing are described in great detail in my two recently published books:

Perl Programming for Medicine and Biology


Ruby Programming for Medicine and Biology

-Jules J. Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, medical autocoding, medical data scrubbing, medical data scrubber, medical record scrubbing, medical record scrubber, medical text parsing, medical autocoder, nomenclature, terminology, deidentification, deidentified, de-identification, de-identified, nomenclature, CUI, unique concept identifier