Monday, February 11, 2008

Fast, accurate medical autocoding with Ruby

In the field of biomedical informatics, it is often necessary to extract medical terms from text and attach a nomenclature concept code to the extracted term. By doing so, concepts of interest contained in text can be retrieved regardless of the choice of words used to describe a concept. For example, hepatocellular carcinoma, liver cell cancer, liver cancer, and hcc might all be given the same code number in a neoplasm nomenclature. Documents using any of these terms can be collected and merged if all of the terms are annotated with the same concept code.

Many people think that it is difficult to write autocoding software [that can parse text, extract terms, and code terms].

Many people think that it is impossible to write fast autocoding software. People accept autocoder speeds that code a typical pathology report at a rate of 1 report (about 1 kilobyte) per second.

Both of these notions are false. A superb autocoder can be written in a few dozen lines of Ruby code. This short coder is fast, coding 20,000 citations in about 21 seconds on a 2.8 GHz desktop CPU with 512 Megabytes RAM). This is a rate of about 100 kilobytes per second. A faster (but more complex) coder has been written by the author using the doublet method.

The output of the coder is virtually perfect. I have prepared a web file that permits anyone to browse through 20,000 abstract titles and inspect the named neoplasms in the abstract text that were coded by the Ruby script. It is available at:"

Doubters can autocode the same list of abstract titles to determine if they can write an autocoder that is as simple, fast or accurate as this short Ruby autocoder.

Here is my Ruby script. For more information about using Ruby for autocoding and for many other biomedical projects, you may want to read my Ruby book.

As with all of my scripts, the following disclaimer applies. This script is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

Note that this script requires two external files, neocl.xml, the neoplasm classification in XML format, available for download as a gzipped file from:

It also requires tumorabs.txt, available at:

text ="neocl.xml", "r")
literalhash =
text.each do
next if (line !~ /\"(C[0-9]{7})\"/)
line =~ /\"(C[0-9]{7})\"/
code = $1;
line =~ /\"\> ?(.+) ?\<\//
phrase = $1;
if (phrase =~ /[a-z]/)
literalhash[phrase] = code
#puts phrase
puts "Neoplasm code hash has been created. Autocoding will start now"
absfile ="tumorabs.txt", "r")
outfile ="tumorabs.out", "w")
absfile.each do
sentence.gsub!(/omas/, "oma")
sentence.gsub!(/tumo[u]?rs/, "tumor")
outfile.puts "\nAbstract title..." + sentence.capitalize + "."
cum_array =
sentence_array = sentence.split
length = sentence_array.size
length.times do
(1..sentence_array.size).each do
phrase = sentence_array.slice(0,place_length).join(" ")
if literalhash.has_key?(phrase)
outfile.puts "Neoplasm term..." + phrase.capitalize + " " + literalhash[phrase]

Here is an example citation, followed by autocoded neoplasm terms.

Abstract title...Obstructive jaundice associated burkitt lymphoma mimicking pancreatic carcinoma.
Neoplasm term...Jaundice C0000000
Neoplasm term...Burkitt lymphoma C7188000
Neoplasm term...Lymphoma C7065000
Neoplasm term...Pancreatic carcinoma C3850000
Neoplasm term...Carcinoma C0000000

Note that only names of neoplasms are coded. Neoplasm-related terms (not strictly the name of any particular neoplasms, are captures with the general code, C0000000.

- Jules Berman

No comments: