This is a continuation of yesterday's blog.
One of the many challenges in the field of machine translation is that expressions (multi-word terms) convey ideas that transcend the meanings of the individual words in the expression. Consider the following sentence:
"The ciliary body produces aqueous humor."
The example sentence has unambiguous meaning to anatomists, but each word in the sentence can have many different meanings. "Ciliary" is a common medical word, and usually refers to the action of cilia. Cilia are found throughout the respiratory and GI tract and have an important role locomoting particulate matter. The word "body" almost always refers to the human body. The term "ciliary body" should (but does not) refer to the action of cilia that move human bodies from place to place. The word "aqueous" always refers to water. Humor relates to something being funny. The term "aqueous humor" should (but does not) relate to something that is funny by virtue of its use of water (as in squirting someone in the face with a trick flower). Actually, "ciliary body" and "aqueous humor" are each examples of medical doublets whose meanings are specific and contextually constant (i.e. always mean one thing). Furthermore, the meanings of the doublets cannot be reliably determined from the individual words that constitute the doublet, because the individual words have several different meanings. Basically, you either know the correct meaning of the doublet, or you don't.
Any sentence can be examined by parsing it into an array of intercalated doublets:
"The ciliary, ciliary body, body produces, produces aqueous, aqueous humor."
The important concepts in the sentence are contained in two doublets (ciliary body and aqueous humor). A nomenclature containing these doublets would allow us to extract and index these two medical concepts. A nomenclature consisting of single words might miss the contextual meaning of the doublets.
What if the term were larger than a doublet? Consider the tumor "orbital alveolar rhabdomyosarcoma." The individual words can be misleading. This orbital tumor is not from outer space, and the alveolar tumor is not from the lung. The 3-word term describes a sarcoma arising from the orbit of the eye that has a morphology characterized by tiny spaces of a size and shape as may occur in glands (alveoli). The term "orbital alveolar rhabdomyosarcoma" can be parsed as "orbital alveolar, alveolar rhabdomyosarcoma" Why is this any better than parsing the term into individual words, as in "orbital, alveolar, rhabdomyosarcoma"? The doublets, unlike the single words, are highly specific terms that are unlikely to occur in association with more than a few specific concepts.
Very few medical terms are single words. In the Neoplasm classification, there are over 135,000 terms and only about 500 are single words. The doublet method uses the multi-word feature of medical terms to extract meaning from text.
This topic is covered in detail in my book, Biomedical Informatics.
To be continued.
- Jules Berman
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.
I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.
tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, medical autocoding, medical data scrubbing, medical data scrubber, medical record scrubbing, medical record scrubber, medical text parsing, medical autocoder, nomenclature, terminology