Thursday, February 14, 2008

The importance of having a FAST medical autocoder

In the past few blogs, I've been writing about medical autocoders.

The medical informatics literature has lots of descriptions of medical autocoders, but most of these descriptions fail to include the speed of the autocoders.

It's been my experience that most published autocoders work at about 500 bytes per second. If a surgical pathology report is 1000 bytes (and I expect that this is roughly the length of a surgical pathology report), a report would take about 2 seconds to autocode.

The autocoder that I wrote about in the past few blogs works at about 100 kilobytes per second (i.e. 1 megabyte of text in ten seconds). For code simplicity, I didn't use the doublet method for this autocoder, and I think had I done so, it would have coded at about 1 Megabyte of text per second in Perl or Ruby (even faster in Python).

Why is it important to have a fast autocoder? Why can't you load your parser with a big file and let it run in the background, taking as long as it takes to finish?

There are three reasons why you absolutely must have a fast autocoder, and I discuss these in my book, Biomedical Informatics, and I thought I'd address the issue in this blog.

1. Medical files today are large. It is not unusual for a large medical center to generate a terabyte of data each week. A slow autocoder could never keep up with the volume of medical information that is produced each day.

2. Autocoders, and the nomenclatures they draw terms from, need to be modified to accommodate unexpected oddities in the text that they parse (particularly formatting oddities and the inclusion of idiosyncratic language to express medical terms). The cycles of running a programming, reviewing output, making modifications in software or nomenclatures, and repeating the whole process many times cannot be undertaken if you need to wait a week for your autocoding software to parse your text.

3. Autocoding is as much about re-coding as it is about the initial process of providing nomenclature codes.

You need to re-code (supply a new set of nomenclature codes for terms in your medical text) whenever you want to change from one nomenclature to another.

You need to re-code whenever you introduce a new version of a nomenclature.

You need to re-code whenever you want to use a new coding algorithm (e.g. parsimonious coding versus comprehensive, or linking code to a particular extracted portion of report)

You need to re-code whenever you add legacy data to your laboratory information systems.

You need to re-code whenever you merge different medical datasets (especially medical datasets that have been coded with different medical nomenclatures).

All of this re-coding adds to the data burden placed on a medical autocoder.

It has been my personal observation that computational tasks that take much time (more than a few seconds) tend to be put on the back burner. So many of the same observations would apply to medical deidentification software. Smart informaticians understand that program execution speed is always very important.

- Jules Berman
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, autcoding, data scrubbing, medical autocoding, medical nomenclature, medical software