Monday, March 3, 2008

Medical Linguistics, Part 4

In the past few blogs, . I have been covering some special linguistic aspects of medical terminologies.

Let's summarize:

1. In a large medical nomenclature, singlets (single-word terms) are infrequent. In our example terminology, the Neoplasm Classification, there are about 500 singlets in a classified nomenclature that contains more than 130,000 terms! By the way, the Neoplasm Classification is available for download as a gzipped XML file.

2. All multi-word terms are composed of doublets (two-word terms), and doublets have a more specific meaning than do singlets.

3. Most multi-word terms in medical nomenclatures are composed of doublets that are found in other terms from the same nomenclature. In the Neoplasm Classification (exceeding 130,000 different terms), there are fewer than 300 terms that cannot be composed of doublets found from other terms.

What do these empirical observations imply?

1. If you parse through any medical text, and you encounter a sequence of words composed of doublets [that are found in a nomenclature], the sequence of words is likely to contain terms from the nomenclature.

2. Conversely, if you parse through any medical text, and you encounter a sequence of words composed of doublets [that are NOT found in a nomenclature], the sequence of words is likely NOT to contain terms from the nomenclature.

3. If you parse through any medical text, and you encounter a sequence of words composed of doublets [that are found in a nomenclature], and the sequence of words does not contain terms from the nomenclature, then the sequence of words may contain one or more new terms that can be added to the nomenclature.

In the next blogs, we will explore how to use these ideas to design software software that can:

1. Automatically extract terms from a medical corpus (large text file)

2. Automatically code the extracted terms that match existing terms in the nomenclature

3. Automatically remove extraneous words from medical recrods that may contain patient identifiers or private information related to patients

4. Identify new candidate terms that may need to be added to the nomenclature

If you understand this blog, and if you have a little programming skill, you can write simple, fast, medical software that can perform many of the common computational tasks encountered by biomedical informaticians.

As per usual, most of the topics explored in this blog have been discussed in my book, Biomedical Informatics. Programming skills for biomedical professionals are taught in my books, Perl Programming for Medicine and Biology and Ruby Programming for Medicine and Biology.

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, medical autocoding, medical data scrubbing, medical data scrubber, medical record scrubbing, medical record scrubber, medical text parsing, medical autocoder, nomenclature, terminology, biomedical informatics, doublet method, medical terminology, medical autocoding, medical autocoder, medical record de-identification, medical record deidentification, medical informaticist, biomedical informaticist, medical informatics, biomedical algorithms, medical algorithms