Specified Life: medical nomenclature

Showing posts with label medical nomenclature. Show all posts

Sunday, September 21, 2014

Lymphadenopathy: a misnomer

Medical nomenclature contains numerous examples of outdated, but widely used terminology.

The term "lymphadenopathy", meaning lymph node disease, is a case in point. In former times, lymph nodes (as they are known now) were known as lymph glands. It was believed that the lymph fluid circulating in the lymph vessels, was produced by the lymph nodes. Organs that produce chemicals that are circulated to other tissues are referred to as glands (e.g., endocrine glands, exocrine glands). Hence the term "lymph gland". A disease of the lymph gland was termed "lymphadenopathy" from lymph + adenos (Greek for gland) + pathei (Greek for disease).

Derivation of lymph fluid.
Source: National Cancer Institute, public domain
The term for a neoplasm of a lymph node was "lymphadenoma"

The term for inflammation of a lymph node was "lymphadenitis"

Nearly everything about lymph node pathology was saddled to the ill-conceived notion that a lymph node is a type of gland.

We now know that lymph is not produced by the glandular activity of lymph nodes. Lymph is interstitial fluid (i.e., fluid between tissue cells) that is absorbed into lymph vessels. Lymph fluid is somewhat milky because it contains white cells, sloughed from lymph nodes, but the fluid comes from tissue interstitium and its composition is akin to blood plasma.

Modern pathologists have dropped the "adeno" in "lymphadenoma" and replaced it with the less confusing term, "lymphoma".

Regrettably, the terms "lymphadenopathy" and "lymphadenitis" persist into modern usage.

- Jules J. Berman, Ph.D., M.D. tags: lymph node, lymphoid, lymphedema, lymphatics, lymphatic vessels, common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, logophile, medical terminology, medical nomenclature, medical dictionary

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

Wednesday, January 28, 2009

Update of Neoplasm Classification is now available

I'm interrupting my series of blogs on bimodal cancer age distributions to announce the release of the most recent version of the Developmental Lineage Classification and Taxonomy of Neoplasms.

The current classification contains 6083 neoplasm concepts (types of neoplasms) classified under 122,698 terms. It also contains a large number of unclassified neoplasm terms as addendum items. It is, by far and away, the world's largest neoplasm nomenclature.

The classification is available in XML, RDF and flat-file formats. Here is the preface text distributed with each formatted version:

"This file was prepared by Jules J. Berman. The first version of this file was created November 15, 2003. The current version was created on January 27, 2009.

Copyright © 2003-2009 Jules J. Berman

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is available.

The neoclxml file is provided "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

An explanation of the classification can be found in the following two publications, which should be cited in any publication or work that may result from any use of this file.

Berman JJ. Tumor classification: molecular analysis meets Aristotle. BMC Cancer 4:8, 2004.

Berman JJ. Neoplasms: Principles of Development and Diversity. Jones and Bartlett Publishers, Sudbury, MA, 2009.

In the Neoplasm Classification, all classified names of neoplasms are coded with a "C" followed by a 7 digit number other than 0000000.

For example, "C9168000" = rectal signet ring adenocarcinoma

In addition to classified terms, there are three groups of unclassified terms that are provided special items that follow the list of classified terms in this file.

"C0000000"
"S" followed by 7 digits
"ST" followed by 7 digits

This list of unclassified terms coded as "C0000000" consists of general cancer terms that do not specify any particular neoplasm; overly specific terms that provide so-call pre-coordinated annotations related to terms contained elsewere in the Classification; and valid terms that have not been added (yet) to the list of classified neoplasm terms.

Examples of overly specific terms are:

squamous carcinoma of the nasal vestibule, gastric non-hodgkin lymphoma of mucosa-associated lymphoid tissue, primary primitive neuroectodermal tumor of the kidney

The terms that are coded "S" followed by 7 digits are inherited syndromes that have a neoplastic component (i.e., the occasional or frequent appearance of neoplasms in the syndrome).

The terms that are coded "ST" followed by 7 digits are staging terms used by oncologists.

The classification is meant for informatics projects that use computer parsing techniques. Programmers should simply insert statements that filter the unclassified terms included in the file."

Additional information may be available from the author's web site:
http://www.julesberman.info/devclass.htm

The Neoplasm Classification is available as a zipped XML file at:
http://www.julesberman.info/neoclxml.zip

The Neoplasm Classification is available as a zipped flat file at:
http://www.julesberman.info/neoself.zip

The Neoplasm Classification is available as a zipped RDF file at:
http://www.julesberman.info/neordf.zip

© 2009 Jules Berman

key words: medical nomenclature, classification, rdf, xml, ontology, data mining, ontology, science

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Tuesday, September 2, 2008

Sample pages for formatted versions of Neoplasm Classification

As announced in a recent blog, the newest version of the free, open source, Developmental Lineage Classification and Taxonomy of Neoplasms has been released. This Classification contains about 135,000 different names of neoplasms classified under about 6,000 neoplasm concepts. It is the largest neoplasm nomenclature in existence.

The classification can be downloaded from:: http://www.julesberman.info/devclass.htm

The Classification is available in several different formats: XML, RDF and flat-file.

I have prepared three web pages that display short excerpts of each document style, so that you can quickly assess the different formats.

The excerpts are available at:

Flat-file: http://www.julesberman.info/plaintxt.htm

XML: http://www.julesberman.info/plainxml.htm

RDF (Resource Description Format): http://www.julesberman.info/plainrdf.htm

A search engine that permits look-ups of neoplasm names, retrieving synonyms and related terms from the Developmental Classificaiton, is available at: http://www.julesberman.info/neoget.htm

-Jules Berman

key words: ontology, classification, medical terminology, medical nomenclature, neoplasms, tumors, tumours, pathology

Monday, August 25, 2008

Update of Medical Abbreviation Web Page

Today, I updated my medical abbreviation web page.

A journal article describes the web page and provides a discourse on medical abbreviations:

Berman JJ. Pathology Abbreviated: A Long Review of Short Terms. Archives of Pathology and Laboratory Medicine, 128:347-352, 2004.

The medical abbreviation page contains about 12,000 medical abbreviations. It is open source and is distributed under a GNU license.

A sampling of the page:


aa = adriamycin
aa = african american
aa = alcohol abuse
aa = alcoholics anonymous
aa = alopecia areata
aa = amino acid
aa = amyloid protein A
aa = aortic aneurysm
aa = aortic arch
aa = aplastic anemia
aa = ara c
aa = arachidonic acid
aa = ascending aorta
aaa = abdominal aortic aneurysm
aaa = acquired aplastic anemia
aaa = acute apical abscess
aaa = aromatic amino acid
aaf = acetylaminofluorene
aag = alpha 1 acid glycoprotein
aah = atypical adenomatous hyperplasia
aai = acute alcohol intoxication
aall = anterior axillary line
aami = age associated memory impairment
aamtase = aklanonic acid methyltransferase
aaox3 = awake alert and oriented to date place & person
aark = automated anesthesia record keeper
aas = aarskog scott syndrome
aas = aortic arch syndrome
aas = atlantoaxial subluxation
aat = aachen aphasia test
aat = alpha 1 antitrypsin
aat = alpha antitrypsin
aau = acute anterior uveitis
aav = aids associated virus
ab = abdominal
ab = abortion
ab = antibody
ab = asthmatic bronchitis
aba = abscissic acid
abatc = azidobenzamidotaurocholate
abc = aneurysmal bone cyst
abc = apnea bradycardia cyanosis
abc = aspiration biopsy cytology
abd = abdomen
abd = abductor
abdom = abdomen
abdv = bleomycin doxorubicin dtic vinblastine
abe = acute bacterial endocarditis
abf = aortobifemoral graft
abg = aortic bifurcation graft
abg = aortobifemoral graft
abg = arterial blood gas
abi = arterial pressure index
abk = aphakic bullous keratopathy
abl = abetalipoproteinaemia
abl = abetalipoproteinemia
abl = african burkitt's lymphoma
abl = angioblastic lymphadenopathy
ablc = amphotericin b lipid complex
abm = alveolar basement membrane antibody
abmt = autologous bone marrow transplant
abn = abnormal
abnl = abnormal
abo = a system of classifying blood groups by a or b antibody
abo = abortion
abo = antibody
abo = bleo mtx vcr
abo hdn = abo hemolytic disease of the newborn
abp = arterial blood pressure
abpa = asthmatic bronchopulmonary aspergillosis
abpb = abductor pollicis brevis
abpl = abductor pollicis longus
abr = auditory brainstem response
abs = abdominal muscles
abs = abdominals
abs = absent
abs = acute brain syndrome
abs = affect balance scale
abs = anorexic behavior scales
abt = autologous blood transfusion
abu = asymptomatic bacteriuria
abv = actinomycin d, bleomycin, vincristine
abv = adriamycin, bleomycin, vinblastine
abv = bleo dox vbl
abvd = bleomycin dacarbazine doxorubicin vincristine
abx = antibiotics
ac = acute
ac = air conditioning
ac = air conduction
ac = alternating current
ac = amniocentesis
ac = ante cibum
ac = anterior chamber
ac = ascending colon
ac = assist control
ac/a = accommodation convergence/accommodation
aca = adenocarcinoma
aca = anterior cerebral artery
aca = anti cardiolipin antibody
acad = academy
acbe = air contrast barium enema
acc = acceleration
acc = accelerator
acc = accident
acc = accommodation
acc = adrenocortical carcinoma
acc = agenesis of corpus callosum
acc = alveolar cell carcinoma
acc = aplasia cutis congenita
acca = adenylate cyclase constitutive activator
acca = adrenal cortical carcinoma
acca = adrenocortical carcinoma
accom = accommodation
accpn = agenesis of the corpus callosum with peripheral neuropathy
acd = acid citrate dextrose
acd = actinomycin d
acd = adult celiac disease
acd = allergic contact dermatitis
acd = anemia of chronic disease
acd = arteriosclerotic coronary disease
acd = automatic cardiac defibrillator procedure
acd = corneal dystrophy
ace = acetylcholinesterase
ace = angiotensin converting enzyme
ace = cyclophosphamide doxorubicin
ace = dipeptidyl peptidase a

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.
-© 2009 Jules J. Berman, Ph.D., M.D.
tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, genetics of complex disease, genetics of common diseases, medical informatics, medical nomenclature,medical terminology, medical abbreviations - Jules J. Berman

Thursday, March 13, 2008

Updated files for the Neoplasm Classification now available

Updated versions of the Neoplasm Classification are now available:

The Neoplasm Classification contains over 135,000 classified names of neoplasms in a biological hierarchy based on developmental lineage of the tumor. It is the largest and most comprehensive neoplasm nomenclature in existence. It is available as a simple XML file, an RDF ontology, or a plain flat-file.

These files were prepared by Jules J. Berman. The first version of this file was created November 15, 2003. The modifications were created on March 13, 2008.

The following applies to the distributed documents:

Copyright (c) 2007-2008 Jules J. Berman. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is available at:
http://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License

The files are provided "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

An explanation of the classification can be found in the following
two publications, which should be cited in any publication or work that may
result from any use of this file.

Berman JJ. Tumor classification: molecular analysis meets Aristotle.
BMC Cancer 4:8, 2004.

Berman JJ. Biomedical Informatics . Jones and Bartlett Publishers,
Sudbury, MA, 2007.

In the Neoplasm Classification, all classified names of neoplasms are coded with a "C" followed by a 7 digit number other than 0000000 or 0000001.

For example, "C9168000" = rectal signet ring adenocarcinoma

In addition to classified terms, there are four groups of unclassified terms that are provided special items that follow the list of classified terms in this file.

"C0000000"
"C0000001"
"S" followed by 7 digits
"ST" followed by 7 digits

This list of unclassified terms coded as "C0000000" consists of general cancer terms that do not specify any particular neoplasm; overly specific terms that provide so-called pre-coordinated annotations related to terms contained elsewere in the Classification; and valid terms that have not been added (yet) to the list of classified neoplasm terms.

Examples of non-specific cancer-related terms are:

borderline tumor
mucinous tumor
blast crisis
preinvasive carcinoma
dysplasia

Examples of overly specific terms are:

squamous carcinoma of the nasal vestibule
gastric non-hodgkin lymphoma of mucosa-associated lymphoid tissue
primary primitive neuroectodermal tumor of the kidney

The terms that are coded with "C0000001" are precancers and related conditions that have not yet been added to the list of classified terms.

The terms that are coded "S" followed by 7 digits are inherited syndromes that have a neoplastic component (i.e., the occasional or frequent appearance of neoplasms in the syndrome).

The terms that are coded "ST" followed by 7 digits are staging terms used by oncologists.

The classification is intended for informatics projects that use computer parsing techniques. Programmers should simply insert statements that filter the unclassified terms included in the file.

Additional information may be available from the author's web site:
http://www.julesberman.info/

The gzipped version of the RDF file (under 1 Megabyte)
http://www.julesberman.info/neorxml.gz

The flat file version, listing each term followed by its lineage (gzipped file).
http://www.julesberman.info/neoself.gz

The plain old XML version, with no RDF semantics (gzipped file).
http://www.julesberman.info/neoclxml.gz

- Jules Berman

Monday, February 25, 2008

Ambiguous disease names

One of the problems in medical autocoding are the existence of polysemous disease names (disease names that correspond to different diseases depending on the context in which they appear).

Here are some examples:

Cervical carcinoma. Is this a carcinoma involving the neck (the structure between your head and your shoulder), or does this refer to a tumor of the uterine cervix?

Medullary carcinoma. This term can refer to medullary carcinoma of breast, or medullary carcinoma of the thyroid gland or medullary carcinoma of the adrenal medulla. All these neoplasms are distinctive, and different tumors.

Paget's disease. This term can refer to a non-neoplastic disease of bones or a neoplastic process that most often involves the nipple overlying a breast cancer.

Because a disease name can refer to different biological diseases, the task of automatically mapping a disease name to a single disease concept is unlikely to be something that can be done with 100% accuracy.

The oddities of medical nomenclatures, and the impact on data mining, are discussed at length in my book, Biomedical Informatics .

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D.
tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, genetics of complex disease, genetics of common diseases, medical nomenclatures, homonyms, medical autocoding, text retrieval, data mining

Sunday, February 17, 2008

Dangerous medical abbreviations

Almost all abbreviations have multiple different expansions. More often than not, it it easy for a human to disambiguate the meaning of an abbreviations that has alternate expansions. For example, you can distinguish AKA ("above knee amputation") from AKA ("also known as"). The context of a sentence determines the meaning.

However, there are many abbreviations that cannot easily be disambiguated, even by experts in a knowledge domain. These abbreviations sometimes arise from what I call "term-drift," wherein another, very similar term, with the same abbreviation, is mistakenly used, and where this misuse gains a foothold in medical culture.

Abbreviations that cannot always be disambiguated are particularly dangerous and are a potential source of medical errors. Here are some examples:

1. ABG aortic bifurcation graft, or aortobifemoral graft

2. AHA acquired hemolytic anemia, or autoimmune hemolytic anemia

3. ASCVD arteriosclerotic cardiovascular disease, or arteriosclerotic cerebrovascular disease

4. CHD congenital heart disease, or congestive heart disease, or coronary heart disease

5. DOA date of admission, or dead on arrival

6. EDC estimated date of conception, or estimated date of confinement ("due date" means almost the opposite of "conception date")

7. HZO herpes zoster ophthalmicus, or herpes zoster oticus

8. IBD inflammatory bowel disease, or irritable bowel disease

9. LLL left lower lid, or left lower lip, or left lower lobe, or left lower lung

10. MCGN mesangiocapillary glomerulonephritis or minimal change glomerulonephritis

11. MVR mitral valve regurgitation, or mitral valve repair, or mitral valve replacement

12. NC no change, or noncontributory

13. NKDA no known drug allergies, or nonketotic diabetic acidosis

14. PE pulmonary effusion, or pulmonary edema, or pulmonary embolectomy or pulmonary embolism

15. SK seborrheic keratosis, or solar keratosis

16. UVF ureterovaginal fistula, or urethrovaginal fistula

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, rare disease, disease genetics, medical abbreviations, disambiguation, dangerous abbreviations, medical nomenclature, medical transcription, electronic health record, ehr, emr, medical errors, medical mistakes, medical terminology, ambiguous terminology, terminology pitfalls and confusing terminology

Friday, February 15, 2008

How are diseases named?

Is there a general rule for naming human diseases? No. Here is a list of some of the many ways by which diseases get their names.

1 As an an expression of a characteristic pathologic process (e.g., muscular dystrophy)

2 For the physical agent that produced the disease (e.g., plumbism)

3 For a group of people who were at high risk for the disease (e.g., Legionnaires' Disease, named after a group of conventioneers who succumbed in an early outbreak)

4 For a molecule found in diseased cells (e.g. amyloidosis, prion disease)

5 For a geographic region in which occurrences of the disease are concentrated (e.g.,Tangier Disease from Tangier Island, Maryland)

6 For the geographic spot from which a widespread epidemic emanated (e.g., Lyme disease from Lyme, New York)

7 For a striking clinical feature of the disease (e.g.,sleeping sickness) )

8 As a crude and insensitive comparison to an non-human object (e.g., gargoylism, ichthyosis with confetti, happy puppet syndrome)

9 As a literary metaphor (e.g., Pickwickian syndrome, Mad Hatter's disease, Alice in Wonderland syndrome, Job's syndrome)

10 For a striking morphologic feature (e.g., sickle cell anemia)

11 For a patient who had the disease (e.g., Lou Gehrig disease)

12 For physician or scientist who treated, described or researched the disease (e.g., Hodgkin disease, Cushing disease, Kaposi sarcoma)

13 As a witty but unhelpful acronym (e.g. CATCH 22 = cardiac abnormality,abnormal facies, t-cell deficit due to thymic hypoplasia, cleft palate, hypocalcemia resulting from a deletion on chromosome 22)

14 As a trope or descriptive metaphor from any existing language (e.g., Moyamoya disease derives from "moyamoya" meaning "puff of smoke" in Japanese,for the characteristic tangle of tiny cerebral vessels seen on x-ray)

15 As a token of Greek or Latin scholarship (e.g., pityriasis lichenoides et varioliformis acuta)

16 As a somewhat obscure and trivial fact that would be understandable only to experts (e.g., one and a half syndrome, which refers to a specific neurologic condition in which one eye acquires movement deficits, while the other eye acquires half of those deficits)

17 As inscrutable combinations of one or more of the above (e.g., the wistful-sounding "floating-harbor syndrome," named by combining the hospital in which one of the first case appeared, Boston Floating Hospital, and for a second hospital in which another case appeared, Harbor General Hospital in Torrance, California)

This list was taken from my book, Biomedical Informatics (List 7.3.1).

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, medical nomenclature, names of diseases, disease terminology, pathology, logophile, medical metaphor, medical terminology, pathologic process, pathophysiology, anatomic pathology, naming diseases, names of diseases, literary medicine, history of medicine

Thursday, February 14, 2008

The importance of having a FAST medical autocoder

In the past few blogs, I've been writing about medical autocoders.

The medical informatics literature has lots of descriptions of medical autocoders, but most of these descriptions fail to include the speed of the autocoders.

It's been my experience that most published autocoders work at about 500 bytes per second. If a surgical pathology report is 1000 bytes (and I expect that this is roughly the length of a surgical pathology report), a report would take about 2 seconds to autocode.

The autocoder that I wrote about in the past few blogs works at about 100 kilobytes per second (i.e. 1 megabyte of text in ten seconds). For code simplicity, I didn't use the doublet method for this autocoder, and I think had I done so, it would have coded at about 1 Megabyte of text per second in Perl or Ruby (even faster in Python).

Why is it important to have a fast autocoder? Why can't you load your parser with a big file and let it run in the background, taking as long as it takes to finish?

There are three reasons why you absolutely must have a fast autocoder, and I discuss these in my book, Biomedical Informatics, and I thought I'd address the issue in this blog.

1. Medical files today are large. It is not unusual for a large medical center to generate a terabyte of data each week. A slow autocoder could never keep up with the volume of medical information that is produced each day.

2. Autocoders, and the nomenclatures they draw terms from, need to be modified to accommodate unexpected oddities in the text that they parse (particularly formatting oddities and the inclusion of idiosyncratic language to express medical terms). The cycles of running a programming, reviewing output, making modifications in software or nomenclatures, and repeating the whole process many times cannot be undertaken if you need to wait a week for your autocoding software to parse your text.

3. Autocoding is as much about re-coding as it is about the initial process of providing nomenclature codes.

You need to re-code (supply a new set of nomenclature codes for terms in your medical text) whenever you want to change from one nomenclature to another.

You need to re-code whenever you introduce a new version of a nomenclature.

You need to re-code whenever you want to use a new coding algorithm (e.g. parsimonious coding versus comprehensive, or linking code to a particular extracted portion of report)

You need to re-code whenever you add legacy data to your laboratory information systems.

You need to re-code whenever you merge different medical datasets (especially medical datasets that have been coded with different medical nomenclatures).

All of this re-coding adds to the data burden placed on a medical autocoder.

It has been my personal observation that computational tasks that take much time (more than a few seconds) tend to be put on the back burner. So many of the same observations would apply to medical deidentification software. Smart informaticians understand that program execution speed is always very important.

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, autcoding, data scrubbing, medical autocoding, medical nomenclature, medical software

Tuesday, February 12, 2008

Medical autocoding with Perl

In yesterday's blog, I showed a short, simple Ruby script that can provide quick and accurate medical autocoding for medical free-text. I also provided a web site where you could inspect 20,000 PubMed abstract titles and the extracted/coded terms produced by the Ruby autocoder.

Today, I'm providing a web site with the equivalent Perl medical autocoder, along with the public domain output file of 20,000 autocoded PubMed abstracts. Surprisingly (to me) the Perl code executed at about the same speed as the Ruby code. Both autocoders would have significant speed gains if they used the doublet method (which I didn't use here because I wanted to demonstrate the shortest possible scripts). The Perl code is contained on the web page.

- Jules Berman

Sunday, February 10, 2008

Update of Neoplasm Classification

An update for the Neoplasm Classification, an open access document distributed under the GNU Free Documentation License, is now available as a gzipped XML file:

NEOCLXML.GZ 719,099 bytes

The latest version of the Neoplasm Classification contains over 146,400 different terms, of which 130,482 are classified names of neoplasms listed under 5,855 concepts.

An explanation of the classification is found in the beginning of the file.

This is the world's largest and most comprehensive listing of neoplasm names and is intended for use in biomedical informatics research and cancer research.

-Jules Berman tags: medical nomenclature, terminology

Monday, January 14, 2008

Parsable Doublets List now available in public domain

Word doublets are two-word phrases that appear in text (i.e., they are not randomly chosen two-word sequences.

Doublets can be used in a variety of informatics projects: indexing, data scrubbing, nomenclature curation, etc. Over the next few days, I will provide examples of doublet-based informatics projects.

A list of over 200,000 word doublets is available for download.

The list was generated from a large narrative pathology text. Thus, the doublets included here would be particularly suitable for informatics projects involving surgical pathology reports, autopsy reports, pathology papers and books, and so on.

The Perl script that generated the list of doublets by parsing through a text file ("pathold.txt"), is shown:


#!/usr/local/bin/perl
open(TEXT,"pathold.txt")||die"cannot";
open(OUT,">doublets.txt")||die"cannot";
undef($/);
$var = <TEXT>;
$var =~ s/\n/ /g;
$var =~ s/\'s//g;
$var =~ tr/a-zA-Z\'\- //cd;
@words = split(/ +/, $var);
foreach $thing (@words)
  {
  $doublet = "$oldthing $thing";
  if ($doublet =~ /^[a-z]+ [a-z]+$/)
    {
    $doublethash{$doublet}="";
    }
  $oldthing = $thing;
  }
close TEXT;
@wordarray = sort(keys(%doublethash));
print OUT join("\n",@wordarray);
close OUT;
exit;

You can generate your own list by substituting any text file you like for "pathold.txt". Keep in mind that the Perl script slurps the entire text file into a string variable, so the script won't work if you use a file that exceeds the memory of the computer. For most computers (with RAM memories that exceed 256 MBytes) this will not be a problem. On my computer (about 2.8 GHz and 512 Mbyte RAM) the script takes about 5 seconds to parse a 9 Megabyte text file).

Since the doublet list below consists of a non-narrative collection of words, it cannot be copyrighted (i.e., it is distributed as a public domain file).

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, biomedical informatics, curation, data scrubbing, deidentification, medical nomenclature, Perl script, public domain, doublets list

Saturday, January 5, 2008

Zipf law for surgical pathology

In almost every segment of life, a small number of items usually account for the bulk of the observed activities. Though there are millions of authors, a relatively small number of authors account for the bulk of the books sold (think J.K. Rowling). A small number of diseases account for the bulk of deaths (think cardiovascular disease and cancer). A few phyla account for the bulk of the diversity of animals on earth (think arthropods). A few hundred words account for the bulk of all word occurrences in literature (think in, be, a, an, the, are). This phenomenon was observed and described by George Kingsley Zipf, who devised Zipf's law as a mathematical description. Wikipedia has an excellent discussion of Zipf's law.

Zipf's law applies to the diagnoses rendered in a pathology department. I helped write an early paper wherein three years' worth of surgical pathology reports, for a a university-associated hospital, were collected and reviewed.

There were 64,921 diagnostic entries (averaging 1.6 SNOMED codes per specimen and 1.4 specimens per patient), that were accounted for by 1,998 different morphologic diagnoses. A mere 21 diagnostic entities accounted for 50% of the code occurrences. 265 entities accounted for 90% of the code occurrences, indicating that the diagnostic efforts of pathology departments are primarily devoted to a small fraction of the many thousands of described pathologic entities.

This paper, published in 1994, is available for review

-Jules J. Berman

I urge you to read more about my book. There's a generous preview of the book at the Google Books site.

tags: common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, pathology, anatomic pathology, medical nomenclature