Friday, February 22, 2008

Ruby, Perl, and Python medical autocoding

The other day, I created a very large web page that included 20,000 PubMed abstracts and the autocoded output for each.

The web page was apparently too large for some people to view, so I cut it down to show about 10,000 autocoded samples, along with the code for the autocoder in Ruby, Perl and Python. It is all available at:

For anyone unfamiliar with medical autocoding, a medical autocoder is a software program capable of parsing large collections of medical records (e.g. radiology reports, surgical pathology reports, autopsy reports, admission notes, discharge notes, operating room notes, medical administrative emails, memoranda, manuscripts, etc.) and capturing the medical concepts contained in the text.

The term "autocoding" should be distinguished from "computer-assisted manual coding." Health care workers may use a software enhancement of their Hospital Information Systems to code a section of text as they enter reports into the computer system. Typically, candidate terms and term codes [from a medical nomenclature] are displayed on the same screen as the entered report. The person entering text is often given the option of editing the proffered codes. This process should not be confused with "autocoding" and is not equivalent to the fully automatic and large-scale coding required by biomedical informaticians.

Finding all the concepts in a corpus of text is a necessary and early step in all data mining efforts. The autocoded terms can be used individually as index terms for the document, on a record-by-record basis to produce a concept "signature" that is highly specific for each report, or collectively to relate the frequency of terms within records with the frequency of terms in the aggregate document.

The simple autocoder provided (in Perl, Python, and Ruby programming languages) is fast (about 100 kilobytes of text per second) and nearly perfect. You can check the output yourself for accuracy (in neoplasm terms extracted and coded). A minor modification of the scripts will accommodate any nomenclature for which terms are assigned concept-code numbers.

- Jules Berman

key words: medical software, nomenclature, medical datamining, perl programming, ruby programming, python programming, biomedical informatics, medical informatics, autcoding, autocoder, medical autocoding
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.