Monday, March 14, 2016

DATA SIMPLIFICATION: Abbreviations and Acronyms

Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 17, 2016). I hope I can convince you that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

"A synonym is a word you use when you can't spell the other one." -Baltasar Gracian

People confuse shortening with simplifying; a terrible mistake. In point of fact, next to reifying pronouns, abbreviations are the most vexing cause of complex and meaningless language. Before we tackle the complexities of abbreviations, let's define our terms. An abbreviation is a shortened form of a word or term. An acronym is a an abbreviation composed of letters extracted from the words composing a multi-word term. There are two major types of abbreviations: universal/permanent and local/ephemeral. The universal/permanent abbreviations are recognized everywhere and have been used for decades (e.g., USA, DNA, UK). Some of the universal/permanent abbreviations, ascend to the status of words whose long-forms have been abandoned. For example, we use laser as a word. Few who use the term know that "laser" is an acronym for "light amplification by stimulated emission of radiation". Local/ephemeral abbreviations are created for terms that are repeated within a particular document or a particular class of documents. Synonyms and plesionyms (i.e., near-synonyms) allow authors to represent a single concept using alternate terms (1).

Abbreviations make textual data complex, for three three principle reasons:

1. No rules exist with which abbreviations can be logically expanded to their full-length form.

2. A single abbreviation may mean different things to different individuals, or to the same individual at different times.

3. A single term may have multiple different abbreviations. (In medicine, Angioimmunoblastic lymphadenopathy can be abbreviated as ABL, AIL, or AIML.) These are the so-called polysemous abbreviations (See Glossary item, Polysemy). In the medical literature, a single abbreviations may have dozens of different expansions (1).

Some of the worst abbreviations fall into one of the following catagories:

Abbreviations that are neither acronyms nor shortened forms of expansions. For example, the short form of "diagnosis" is "dx", although no "x" is contained therein. The same applies to the "x" in "tx", the abbreviation for "therapy", but not the "X" in "TX" that stands for Texas. For that matter, the short form of "times" is an "x", relating to the notation for the multiplication operator. Roman numerals I, V, X, L and M are abbreviations for words assigned to numbers, but they are not characters included in the expanded words (e.g., there is no "I" in "one"). EKG is the abbreviation for electrocardiogram, a word totally bereft of any "K". The "K" comes from the German orthography. There is no letter "q" in subcutaneous, but the abbreviation for the word is sometimes "subq"; never "subc". What form of alchemy converts ethanol to its common abbreviation, "EtOH"?

Mixed-form abbreviations. In medical lingo "DSV" represents the Dermatome of the fifth (V) Sacral nerve. Here a preposition, an article, and a noun (of, the, nerve) have all been unceremoniously excluded from the abbreviation; the order or the acronym components have been transposed (dermatome sacral fifth); an ordinal has been changed to a cardinal (fifth changed to five), and the cardinal has been shortened to its roman numeral equivalent (V).

Prepositions and articles arbitrarily retained in an acronym. When creating an abbreviation, should we retain or abandon prepositions? Many acronyms exclude prepositions and articles. USA is the acronym for United States of America; the "of" is ignored. DOB (Date Of Birth) remembers the "of".

Single expansions with multiple abbreviations. Just as abbreviations can map to many different expansions, the reverse can occur. For instance, high-grade squamous intraepithelial lesion can be abbreviated as HGSIL or HSIL. Xanthogranulomatous pyelonephritis can be abbreviated as xgp or xgpn.

Recursive abbreviations. The following example exemplifies the horror of recursive abbreviations. The term SMETE is the abbreviation for the phrase "science, math, engineering, and technology education". NSDL is a real-life abbreviation, for "National SMETE digital Library community". To fully expand the term (i.e., to provide meaning to the abbreviation), you must recursively expand the embedded abbreviation, to produce "National science, math, engineering, and technology education digital Library community."

Stupid or purposefully unhelpful abbreviations. The term GNU (Gnu is not UNIX) is a recursive acronym. Fully expanded, this acronym is of infinite length. Although the N and the U expand to words ("Not Unix"), the letter G is simply inscrutable. Another example of an inexplicable abbreviation is PT-LPD (post-transplantation lymphoproliferative disorders). The only logical location for a hyphen would be smack between the letters p and t. Is the hyphen situated between the T and the L for the sole purpose of irritating us?

Abbreviations that change from place to place. Americans sometimes forget that most English-speaking countries use British English. For example an esophagus in New York is an oesophagus in London. Hence TOF makes no sense as an abbreviation of tracheo-esophageal fistula here in the U.S. but this abbreviation makes perfect sense to physicians in England, where a patients may have a Trancheo-Oesophageal Fistula. The term GERD (representing the phrase gastroesophageal reflux disease) makes perfect sense to Americans, but it must be confusing in Britain, where the esophagus is not an organ.

Abbreviations masquerading as words. Our greatest vitriol is reserved for abbreviations that look just like common words. Some of the worst offenders come from the medical lexicon: axillary node dissection (AND), acute lymphocytic leukemia (ALL), Bornholm Eye Disease (BED), and Expired Air Resuscitation (EAR). Such acronyms aggravate the computational task confidently translating common words. Acronyms commonly appear as uppercase strings, but a review of a text corpus of medical notes has shown that words could not be consistently distinguished from homonymous word-acronyms (2).

Fatal abbreviations. Fatal abbreviations are those which can kill individuals if they are interpreted incorrectly. They all seem to originate in the world of medicine:

MVR, which can be expanded to any of: mitral vale regurgitation, mitral valve repair, or mitral valve replacement;

LLL, which can be expanded to any of: left lower lid, left lower lip, or left lower lung;

DOA, dead on arrival, date of arrival, date of admission, drug of abuse.

Is a fear of abbreviations rational, or does this fear emanate from an overactive imagination? In 2004, the Joint Commission on Accreditation of Healthcare Organizations, a stalwart institution not known to be squeamish, issued an announced that, henceforth, a list of specified abbreviations should be excluded from medical records Rboodr.

Examples of Forbidden abbreviations are:

IU (International Unit), mistaken as IV (intravenous) or 10 (ten).

Q.D., Q.O.D. (Latin abbreviation for once daily and every other day), mistaken for each other.

Trailing zero (X.0 mg) or a lack of a leading zero (.X mg), in which cases the decimal point may be missed. Never write a zero by itself after a decimal point (X mg), and always use a zero before a decimal point (0.X mg).

MS, MSO4, MgSO4 all of which can be confused with one another and with morphine sulfate or magnesium sulfate. Write "morphine sulfate" or "magnesium sulfate."

Abbreviations on the hospital watch list were:

mg (for microgram), mistaken fir mg (milligrams), resulting in a 1000-fold dosing overdose.

h.s., which can mean either half-strength or the Latin abbreviation for bedtime or may be mistaken for q.h.s., taken every hour. All can result in a dosing error.

T.I.W. (for three times a week), mistaken for three times a day or twice weekly, resulting in an overdose.

The list of abbreviations that can kill, in the medical setting, is quite lengthy. Fatal abbreviations probably devolved through imprecise, inconsistent, or idiosyncratic uses of an abbreviation, by the busy hospital staff who enter notes and orders into patient charts. For any knowledge domain, the potentially fatal abbreviations is the most important to catch.

Nobody has ever found an accurate way of disambiguating and translating abbreviations (1). There are, however a few simple suggestions, based on years of exasperating experience, that might save you time and energy.

1. Disallow the use of abbreviations, whenever possible. Abbreviations never enhance the value of information. The time saved by using an abbreviation is far exceeded by the time spent attempting to deduce its correct meaning.

2. When writing software applications that find and expand abbreviations, the output should list every known expansion of the abbreviation. For example, the abbreviation, "ppp" appearing in a medical report, should have all these expansions inserted into the text, as annotations: pancreatic polypeptide, palatopharyngoplasty, palmoplantar pustulosis, pancreatic polypeptide, pentose phosphate pathway, platelet poor plasma, primary proliferative polycythaemia, primary proliferative polycythemia. Leave it up to the knowledge domain experts to disambiguate the results.

- Jules Berman (copyrighted material)

key words: computer science, data analysis, data repurposing, data simplification, simplifying data, abbreviations, acronyms, complexity jules j berman


[1] Berman JJ. Pathology abbreviated: a long review of short terms. Arch Pathol Lab Med 128:347-352, 2004.

[2] Nadkarni P, Chen R, Brandt C. UMLS concept indexing for production databases. JAMIA 8:80-91, 2001.

No comments: