Specified Life: February 2008

Friday, February 29, 2008

The linguistic basis for the doublet medical autocoder and scrubber (Part 1)

In yesterday's blog, I presented a very fast (0.2 Megabytes per second) combined medical autocoder and medical scrubber. The software application is written in a few dozen lines of Perl. The output of the autocoder/de-identification software on a corpus of 95,260 PubMed Citations is available.

In the next few days, I'm going to try to explain the linguistic basis of how this software works.

First, you need to understand that medical terms occur primarily as phrases, not as single words.

The Neoplasm Classification contains over 135,000 names of neoplasms. Of these 135,000 terms, only about 500 are single-word terms.

Here is the list of about 500 single word terms in the Neoplasm Classification. The Neoplasm Classification is available in its entirety as a gzipped xml file.

List of single word terms from the Neoplasm Classification:

"acanthoma, achrochordon, acml, acrochordon, adamantinoma, adenoacanthoma, adenocarcinofibroma, adenocarcinoma, adenofibroma, adenolipoma, adenolymphoma, adenoma, adenomyoepithelioma, adenomyoma, adenopathy, adenosarcoma, aesthesioneuroblastoma, aesthesioneurocytoma, aesthesioneuroepithelioma, aild, allp, ameloblastoma, amml, androblastoma, angioblastoma, angioendothelioma, angiofibroma, angioglioma, angiokeratoma, angioleiomyoma, angiolipoma, angioma, angiomyofibroblastoma, angiomyolipoma, angiomyoliposarcoma, angiomyoma, angiomyosarcoma, angiomyxolipoma, angiomyxoma, angiosarcoma, anll, apl, apml, apudoma, argentaffinoma, arrhenoblastoma, asthesioneuroblastoma, asthesioneurocytoma, astroblastoma, astrocytoma, astroglioma, asymptomatic, atll, atypia, baltoma, basalioma, basiloma, blastoma, bnct, branchioma, calc, calcification, cancer, carcinofibroma, carcinoid, carcinoma, carcinosarcoma, cavernoma, cementoblastoma, cementoma, ceruminoma, chemodectoma, chloroleukaemia, chloroleukemia, chloroma, chloromyeloma, chlorosarcoma, cholangioadenoma, cholangiocarcinoma, cholangiohepatoma, cholangioma, choledochocele, choledochocoele, choledochocyst, cholesteatoma, chondroblastoma, chondroma, chondrosarcoma, chondrosteoma, chorangioma, chordocarcinoma, chordoepithelioma, chordoma, chorioadenoma, chorioangioma, choriocarcinoma, chorioepithelioma, chorionepithelioma, chromaffinoma, cml, collagenoma, comedocarcinoma, corticotropinoma, cpnet, craniopharyngioma, cylindroma, cyst, cystadenocarcinoma, cystadenofibroma, cystadenoma, cystoma, dentinoma, deposit, dermatofibroma, dermatofibrosarcoma, dermatomyofibroma, dermoid, desmoid, diktyoma, dipnech, dnet, dyscrasia, dysembryoma, dysgerminoma, dysplasia, ecchondroma, ectomesenchymoma, elastofibrolipoma, elastofibroma, elst, embryoma, enchondroma, endometrioma, endometrium, endotheliosarcoma, enteroglucagonoma, ependymoblastoma, ependymoma, epithelioma, erythraemia, erythremia, erythrocythaemia, erythrocythemia, erythrocytophagy, erythrodysplasia, erythroleukaemia, erythroleukemia, erythrophagia, erythroplakia, esthesioneuroblastoma, esthesioneurocytoma, esthesioneuroepithelioma, fibroadenoma, fibroepithelioma, fibrofolliculoma, fibroid, fibroleiomyoma, fibrolipoma, fibroliposarcoma, fibroma, fibromyoma, fibromyxolipoma, fibromyxoma, fibromyxosarcoma, fibroodontoma, fibrosarcoma, fibrothecoma, fibroxanthogranuloma, fibroxanthoma, fibroxanthosarcoma, gammopathy, gammapathy, gangliocytoma, ganglioglioma, ganglion, ganglioneuroblastoma, ganglioneurofibroma, ganglioneuroma, gangliorhabdomyosarcoma, gant, gastrinoma, gemistocytoma, germinoblastoma, germinoma, gist, glioblastoma, gliofibroma, glioma, glioneuroma, gliosarcoma, glomangioma, glomangiomyoma, glomangiopericytoma, glomangiosarcoma, glucagonoma, gonadoblastoma, gonadotrophinoma. gonocytoma, granulocytopenia, gtni, gynandroblastoma, haemangioblastoma, haemangioendothelioma, haemangioma, haemangiopericytoma, haemangiosarcoma, haemolymphangioma, hamartoma, hemangioblastoma, hemangioendothelioma, hemangioma, hemangiopericytoma, hemangiosarcoma, hemolymphangioma, hepatoblastoma, hepatocarcinoma, hepatocholangiocarcinoma, hepatoma, hgsil, hibernoma, hidradenocarcinoma, hidradenoma, hidroacanthoma, hidrocystoma, histiocytoma, hlrcc, hsil, hygroma, hypernephroma, hyperplasia, hyperplastic, hyperprolactinaemia, hyperprolactinemia, hypersensitivity, hyperthyroidism, idl, igcnu, immunoblastoma, immunocytoma, immunodeficiency, infertility, insulinoma, insuloma, ipmt, itgcn, jgct, keloid, keratoacanthoma, keratoameloblastoma, leiomyoblastoma, leiomyofibroma, leiomyoma, leiomyomata, leiomyosarcoma, lentigo, leptomeningioma, leucaemia, leucoplakia, leucorrhoea, leukaemia, leukemia, leukoplakia, leukorrhagia, leukorrhea, leukorrhoea, lipoadenoma, lipoblastoma, lipogranuloma, lipoleiomyoma, lipoma, lipomata, liposarcoma, lphd, luteinoma, luteoma, lymphadenoma, lymphadenopathy, lymphangioendothelioma, lymphangioleiomyoma, lymphangioma, lymphangiomyoma, lymphangiosarcoma, lymphepithelioma, lymphoblastic, lymphoblastoma, lymphocytoma, lymphoepithelioma, lymphoma, lymphosarcoma, macroglobulinaemia, macroglobulinemia, macroprolactinoma, maffucci, malignancy, malnutrition, malt, maltoma, mantleoma, masculinovoblastoma, mastocytoma, medulloblastoma, medullocytoma, medulloepithelioma, medullomyoblastoma, melanoacanthoma, melanoameloblastoma, melanocarcinoma, melanocytoma, melanoma, melanomeloblastoma, melanosarcoma, meningioma, mesenchymoma, mesonephroma, mesothelioma, metaplasia, mgct, micca, microglioma, microinvasive, microprolactinoma, milia, mmgct, mnti, mole, mucocele, myeloblastoma, myelodysplasia, myelolipoma, myeloliposarcoma, myeloma, myelosarcoma, myelosuppression, myoblastoma, myoepithelioma, myofibroblastoma, myofibroma, myofibrosarcoma, myolipoma, myoma, myopericytoma, myosarcoma, myxofibroma, myxofibrosarcoma, myxolipoma, myxoliposarcoma, myxoma, myxosarcoma, naevi, naevoxanthoendothelioma, ncmh, neoplasia, neoplasm, nephroblastoma, nephroma, nephropathy, nesidioblastoma, neurilemmoma, neurilemmosarcoma, neurilemoma, neurinoma, neuroastrocytoma, neuroblastoma, neurocytoma, neuroepithelioma, neurofibroma, neurofibrosarcoma, neurolipocytoma, neuroma, neuronaevi, neuronevi, neurosarcoma, neurotensinoma, neurotheceoma, neurothecoma, neurothekeoma, neutropenia, nevi, nevocarcinoma, nevoxanthoendothelioma, nlphd, odontoameloblastoma, odontoma, oligoastrocytoma, oligodendroblastoma, oligodendroglioma, ollier, oncocytoma, orchioblastoma, oscst, osteoblastoma, osteochondroma, osteochondromyxoma, osteochondroplastica, osteochondrosarcoma, osteoclastoma, osteodermia, osteofibrosarcoma, osteoma, osteosarcoma, pachydermatocele, pancreatoblastoma, pancytopenia, panin, papilloma, parachordoma, paraganglioma, pecoma, pericytoma, perifolliculoma, perineurioma, perithelioma, phaeochromoblastoma, phaeochromocytoma, pheochromoblastoma, pheochromocytoma, piloleiomyoma, pilomatricoma, pilomatrixoma, pinealblastoma, pinealoblastoma, pinealocytoma, pinealoma, pineoblastoma, pineocytoma, pituicytoma, plasmacytoma, pnet, pneumoblastoma, pneumocytoma, polycythaemia, polycythemia, polyembryoma, polyp, porocarcinoma, poroma, ppnet, precancer, preleukaemia, preleukemia, prelymphoma, premalignancy, preneoplasia, presarcoma, prolactinoma, psammoma, pseudolymphoma, pseudoneuroma, pstt, ptgc, ptyalocele, raeb, raem, ranula, reah, reninoma, reticulohistiocytoma, reticulolymphosarcoma, reticulosarcoma, retinoblastoma, retinocytoma, retinoma, rhabdomyoma, rhabdomyosarcoma, sarcoma, schimmelbusch, schwannoma, sebaceoma, sega, seminoma, settle, sgat, siadh, sialoblastoma, sialocele, sinonasalhaemangiopericytoma, sinonasalhemangiopericytoma, somatostatinoma, somatotropinoma, spermatocytoma, spiradenocarcinoma, spiradenoma, splenomegaly, spongioblastoma, spongioneuroblastoma, sterile, sterility, stromomyoma, sturge, subependymoma, sympathicoblastoma, sympathicogonioma, sympathogonioma, symptom, syncytioma, synovioma, syringadenoma, syringoadenoma, syringocystadenoma, syringocystoma, syringofibroadenoma, syringoma, teratocarcinoma, teratocarcinosarcoma, teratoma, thecoma, thrombocythaemia, thrombocythemia, thymolipoma, thymoma, trauma, trichilemmocarcinoma, trichilemmoma, trichoadenoma, trichoblastoma, trichodiscoma, trichoepithelioma, trichofolliculoma, trichogerminoma, tricholemmoma, tumor, tumour, undifferentiated, vain, verruca, verrucae, vipoma, wegener, werner, xanthelasma, xanthofibroma, xanthogranuloma, xanthoma, xanthosarcoma"

The consequences of a terminology that consists almost exclusively as multi-word phrases will impact the design of software that extracts medical terms from large text files. (to be continued).

- Jules Berman

key words: deidentification, medical nomenclature, term extraction, scrubber, scrubbing

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Thursday, February 28, 2008

A fast (combined) medical autocoder and scrubber

In today's blog, I discuss a newly loaded public domain file that contains the combined autocoded and scrubbed output for 95,260 PubMed Citations (computed in under a minute).

In the field of biomedical informatics, the term "scrubbing" refers to removing patient identifiers from confidential medical records. The term "autocoding" refers to extracting medical terms from text and providing terms with a concept code contained in a nomenclature.

I have prepared a public domain corpus of 95,260 PubMed citations that have been autocoded using the Neoplasm Classification. The Neoplasm Clasification is available as a gzipped xml file . All of the named neoplasms and all of the general non-specific terms for neoplasms (such as the word, "tumor") have been automatically extracted from the text.

In addition, all of the citations have been de-identified. Words that might be identifiers are replaced by an asterisk.

On a web site, I have listed the first thousand entries in the file , just so that you get an idea of what a sample output might look like.

If you are curious about autocoding or in medical record scrubbing (also called de-identification), you should visit two of my other web sites, that discuss these two topics in greater detail.

Autocoding (topic)

and

Medical Data Scrubbing (topic)

The automatic coder and scrubber consists of a few dozen lines of Perl code. The file that is coded and scrubbed contains 95,260 PubMed Citations and has a length of over 10 Megabytes. Autocoding and scrubbing took under a minute on a modest 2.8 GHz desktop computer with 512 Mbytes of RAM. This is a rate of about 200 Kilobytes per second.

The entire input file and the entire output file are available as gzipped text files, both available from my website:

Input text file (10 Megabytes expanded)

and

Output autocoded and deidentified file (25+ Megabytes expanded)

They are public domain documents.

You can check for yourself the accuracy of the scrubber and autocoder. You will find that virtually no names of neoplasms were missed and that virtually no identifiers were left in the scrubbed text.

Medical autocoding and medical record scrubbing are described in great detail in my two recently published books:

Perl Programming for Medicine and Biology

and

Ruby Programming for Medicine and Biology

-Jules J. Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, medical autocoding, medical data scrubbing, medical data scrubber, medical record scrubbing, medical record scrubber, medical text parsing, medical autocoder, nomenclature, terminology, deidentification, deidentified, de-identification, de-identified, nomenclature, CUI, unique concept identifier

Monday, February 25, 2008

Ambiguous disease names

One of the problems in medical autocoding are the existence of polysemous disease names (disease names that correspond to different diseases depending on the context in which they appear).

Here are some examples:

Cervical carcinoma. Is this a carcinoma involving the neck (the structure between your head and your shoulder), or does this refer to a tumor of the uterine cervix?

Medullary carcinoma. This term can refer to medullary carcinoma of breast, or medullary carcinoma of the thyroid gland or medullary carcinoma of the adrenal medulla. All these neoplasms are distinctive, and different tumors.

Paget's disease. This term can refer to a non-neoplastic disease of bones or a neoplastic process that most often involves the nipple overlying a breast cancer.

Because a disease name can refer to different biological diseases, the task of automatically mapping a disease name to a single disease concept is unlikely to be something that can be done with 100% accuracy.

The oddities of medical nomenclatures, and the impact on data mining, are discussed at length in my book, Biomedical Informatics .

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D.
tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, genetics of complex disease, genetics of common diseases, medical nomenclatures, homonyms, medical autocoding, text retrieval, data mining

Sunday, February 24, 2008

Medical autocoder circa 1994

I just added another publication to my web site. This 1994 paper published in the American Journal of Clinical Pathology, as a government work, provides some early data on the accuracy of medical autocoders.

The abstract:

Many pathology departments rely on the accuracy of computer-generated diagnostic coding for surgical specimens. At present, there are no published guidelines for assuring the quality of coding devices. To assess the performance of SNOMED coding software, manual coding was compared with automated coding in 9,353 consecutive surgical pathology reports at the Baltimore VA Medical Center. Manual SNOMED coding produced 13,454 diagnostic entries comprising 519 distinct diagnostic entities; 209 were unique diagnoses (assigned to only one of the 9,353 reports). Automated coding obtained 23,744 diagnostic entries comprising 498 distinct diagnostic entities, of which 129 were unique diagnoses. There were only 44 instances (0.5%) where automated coding missed key diagnoses on surgical case reports. In summary, automated coding compared favorably with manual coding. To achieve the maximum performance from software coding applications, departments should monitor the output from automatic coders. Modifications in reporting style, code dictionaries, and coding algorithms can lead to improved coding performance.

The last line of the abstract, "Modifications in reporting style, code dictionaries, and coding algorithms can lead to improved coding performance," has been my mantra for the past 14 years.

- Jules Berman

key words: biomedical informatics, medical informatics, medical record retrieval, medical record indexing, biomedical autocoding, biomedical autocoder, surgical pathology reports

Saturday, February 23, 2008

Confused medical terms

There are many medical terms that have nearly the same orthography, are often pronounced identically, and have completely different meanings. These words are not picked up by spell checkers (because they are not misspelled), and occasionally appear as erroneous text within medical records.
Examples are:

acinic, actinic
anisakiasis, anisokaryosis
Apert syndrome, Alport syndrome (Apert syndrome 
    is a rare disorder characterized by early 
    fusion of skull bones. Alport syndrome is a 
    rare disorder characterized by kidney disease,
    hearing loss, and eye abnormalities.)
aptotic, apoptotic
arboreal, aboriginal
arteritis, arthritis
aural, oral
auxilliary, axillary
brachial, brachium, branchial
callous, callus
Carney triad, Carney complex (Carney Triad
    is gastric leiomyosarcoma, pulmonary chondroma
    extraadrenal paraganglioma, occurring 
    mainly in young women. Carney complex is
    myxoma, spotty pigmentation, and
    endocrinopathy.  Both Carney triad and Carney
    complex are technically Carney syndromes.
causality, casualty
chlorpropamide, chlorpromazine
chondroid, chordoid
chondroma, chordoma
chorionic, chronic
cingula, singular
coitus, colitis
colic, colonic
colitis, coitus
costal, coastal
cryptogam, cryptogram
cygnet, signet
decease, disease
deceased, desist
digitalize, digitize
digitate, digitize
dioecious, deciduous
diploic, diploid
disc, disk
disease, decease
diseased, deceased
disseminated sclerosis, systemic sclerosis (the first is 
        multiple sclerosis, and the second is scleroderma)
dyskaryosis, dyskeratosis
dysphasia, dysphagia
E coli (the Amoebozoa), E coli (the Enterobacteriaceae)
ectatic, ecstatic
endochondral, enchondral (these are synonyms)
engram, n-gram, ngram
epistasis, epistaxis, epitaxis 
        (the last is a misspelling of the second)
exxon, exon
facial, fascial
facies, faeces
falx, false
fetal, fatal
fibrinous, fibrous
fibrosis, fibrositis
firearm, forearm
foreword, forward
fossa, phossy
Gnathostoma, Gnathostomata (Gnathostoma genus of helminths, 
   Gnathostomata class of jawed vertebrates)
hallux, helicis
helicis, hallux
herpetic, herpangina
hydatid, hydatidiform
hypochondrium, hypochondria
ileitis, iliitis
ileum, ilium
insular, insulin
intercostal, intercoastal
intubation, incubation
isotope, isotrope
kerasin, kerosene, keratin
keratotic, keratinic, ketotic
keratinocytic, keratinolytic
keratosis, ketosis
lipoma, lymphoma
lumbar, lumber
malleolus, malleus
meniere disease, menetrier disease (former is an inner ear disorder, 
      latter is a hypertrophic gastropathy)
metachronous, metacrinus
milia, milium
miotic, mitotic, meiotic
mitogenic, mitogenomic
mitosis, meiosis, myosis, myiasis
monogenic, monogenetic, and Monogenetic (last, 
      related to class Monogenea)
mucous, mucus
myelofibrosis, myofibrosis
myelogenous, myelopathy (the former refers 
      to blood forming cells, the latter to spinal cord disease)
myelopathy, myotilinopathy (the former is 
      spinal cord disease, the latter a type of myofibrillar myopathy)
myofibroma, myelofibroma
neuroplastic, neoplastic
nucleus, nucleolus
oncocyte, onychocyte
oncology, ontology, ontogeny
organic, organoid
ornithine, ornithurine
otic, optic (otic ear, optic eye)
palatal, palatial
paleodontology, paleontology
palette, palate
palpation, palpitation
panacea, placebo (one cures all, the other cures none)
parasite, pericyte
parental, parenteral
pathogen, parthenogen
pathogenesis, parthenogenesis
pathogenic, pathogenetic (these two are synonyms)
pediculated, pedunculated
penal, penile, pineal, panel
penicillamine, penicillin
perineal, peroneal, perianal
phyllodes, phylloides
pigmentosa, pigmentosum (retinitis pigmentosa and xeroderma pigmentosum)
pleiotropic, pleiotrophic, pleiotypic (the first two are synonyms)
plural, pleural
polypoid, polyploid
porphyria, porphyruria
proptosis, ptosis
prostrate, prostate
protuberant, protruberant (the second term is simply a common misspelling)
pyelonephritis, pyonephritis
quinine, quinidine, quinone
rachischisis, rachitis, rachischitic, rachitic
radial, radical
relics, relicts
reticle, reticule, radical
rett syndrome, RET gene, 
rett syndrome, Tourette syndrome
rosacea, rosea
semantic, somatic
serous, serious
silicon, silicone
singleton, singultus
sinusitis, synositis
somatic, semantic, semitic
sonography, somnography, stenography
taenia, tinea
takoma, trachoma
thecoma, thekeoma
Tietze syndrome, Tietz syndrome (Tietze syndrome 
   is chondropathia tuberosa or costochondral junction 
   syndrome; Tietz syndrome is albinism-deafness
   syndrome, an autosomal dominant congenital disorder)  
torsion, distortion
trachoma, trachea
trichina, trachoma, trichura
trichinosis, trichosis, trichuriasis
trichrome, trichome
trochlear, tracheal
troglobite, troglodyte, trilobite
tuberous sclerosis, tuberculosis
tunicate, tourniquet, turbinate
typhoid, typhus
urethral, ureteral
vagitis, vaginitis
venous, venus
viscous, viscus

Medical transcriptionists and other healthcare professionals should be aware of the correct meaning of each alternate word in these listed pairs and groups.

I urge you to read more about my book. There's a generous preview of the book at the Google Books site.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, rare disease, medical terminology, medical errors, malaprop, malapropism, definition, confusing terms, confused medical terms, confusing medical terms, medical definitions, medical transcription, nomenclature, orphan drugs, rare diseases, terminology, medical transcription, common mistakes, common errors, typographical errors, sources of confusion, sources of error, medical dictionary, tricky medical terms, medical informatics

Friday, February 22, 2008

Ruby, Perl, and Python medical autocoding

The other day, I created a very large web page that included 20,000 PubMed abstracts and the autocoded output for each.

The web page was apparently too large for some people to view, so I cut it down to show about 10,000 autocoded samples, along with the code for the autocoder in Ruby, Perl and Python. It is all available at:

http://www.julesberman.info/rubycode.htm

For anyone unfamiliar with medical autocoding, a medical autocoder is a software program capable of parsing large collections of medical records (e.g. radiology reports, surgical pathology reports, autopsy reports, admission notes, discharge notes, operating room notes, medical administrative emails, memoranda, manuscripts, etc.) and capturing the medical concepts contained in the text.

The term "autocoding" should be distinguished from "computer-assisted manual coding." Health care workers may use a software enhancement of their Hospital Information Systems to code a section of text as they enter reports into the computer system. Typically, candidate terms and term codes [from a medical nomenclature] are displayed on the same screen as the entered report. The person entering text is often given the option of editing the proffered codes. This process should not be confused with "autocoding" and is not equivalent to the fully automatic and large-scale coding required by biomedical informaticians.

Finding all the concepts in a corpus of text is a necessary and early step in all data mining efforts. The autocoded terms can be used individually as index terms for the document, on a record-by-record basis to produce a concept "signature" that is highly specific for each report, or collectively to relate the frequency of terms within records with the frequency of terms in the aggregate document.

The simple autocoder provided (in Perl, Python, and Ruby programming languages) is fast (about 100 kilobytes of text per second) and nearly perfect. You can check the output yourself for accuracy (in neoplasm terms extracted and coded). A minor modification of the scripts will accommodate any nomenclature for which terms are assigned concept-code numbers.

- Jules Berman

key words: medical software, nomenclature, medical datamining, perl programming, ruby programming, python programming, biomedical informatics, medical informatics, autcoding, autocoder, medical autocoding

Thursday, February 21, 2008

Tools to battle the complexity of biomedical software and medical information systems

Those who regularly read this blog know that one of my pet peeves is the increasing complexity of biomedical software. My belief is that complex systems are chaotic and unpredictable, and the best way to deal with software complexity is to eliminate it.

Here is a list of the basic intellectual tools that I believe can help reduce complexity.

1 Classifications. A class inherits properties in a direct lineage from a parent class. An object can only occupy a single class. Classifications are easy to understand and compute. This is the definition of classification that is used by biologists (as in the classification of all living organisms) and applies well to computer science. Classifications are related to (but different from) ontologies. Ontologies, unlike classifications, can become hightly complex. Classifications always reduce the complexity of a knowledge domain.

2 Flat data files that can be extended but not re-written. A telephone book is a close example. If people never changed their names, never died, and never changed their telephone numbers, a telephone directory would be an ideal example. Data that can be sensibly organized in this kind of flat file is very simple to work with.

3 The EMR (electronic medical record). The EMR is the digital equivalent of the patient chart. In this model, all new clinical reports pertaining to a patient are inserted into the EMR object for the patient. This is a simple data model that can work well so long as one and only one record is created for each patient.

4 Small, self-contained specialized information systems. These applications are designed for a specific and narrow function (e.g. cytopathology information system). Complexity does not intervene until the specialized information system needs to interact with other systems in the hospital.

5 Fundamental algorithms. Almost all important algorithms are simple and can be explained in a few steps. From these simple algorithms, complex systems can arise.

6 Simple protocols. Very simple protocols can support incredibly complex systems. TCP/IP (the internet protocol)is a simple strategy for transferring packets of information over a network of computers.

7 Elegant object oriented programming languages, such as Ruby. Though Ruby is a simple and elegant language, it can be used to create hopelessly complex software. Programmers need extensive training in design principles that minimize complexity.

8 Specifications. Specifications are formal ways of explaining what you've done so that computers and humans can understand and replicate your work. It is important to have a standard syntax for describing data and for organizing information into meaningful statements that can be interpreted by software agents ( RDF is a fine example). I distinguish specifications from standards. Informatics standards impose an idiosyncratic, specialized format on data and tend to increase the complexity of information across different data domains.

9 Unique data identifiers. Computers are good at creating and tracking unique identifiers.

10. Encryption algorithms. It is easy to make something a secret.

11 De-identified public datasets. Publicly released de-identified data simplifies research by permitting multiple projects on the same set of data. With remarkably few exceptions (zero, in my opinion), de-identified public medical datasets have not hurt patients.

Most programmers would include UML (Unified Modeling Language) in this list. I left it out because UML seems very complex to me and it permits programmers to manage complexity (rather than reduce or eliminate complexity). I confess that I do not know much about UML, but this is my current perception.

The topic of medical software complexity is a topic that I discuss at great length in my recently published book, Biomedical Informatics.

- Jules Berman

key words: medical informatics, informatics complexity, classification, ontologies, ontology, hospital information systems, laboratory information systems

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

Wednesday, February 20, 2008

JCAHO policy on abbreviations

Effective January 1, 2004, hospitals accredited by the Joint Commission on Accreditation of Healthcare Organizations (JCAHO) were required to exclude certain types of abbreviations from hand-written medical records.

As yet (to the best of my knowledge), there is no equivalent ruling for the realm of electronic medical records. Electronic records provide enormous opportunity for the creation and propagation of miscommunications that can lead to medical errors.

Trailing and leading zeros, micrograms (mcg, not µg) and units (units, not U) are issues that can be easily solved in an electronic record. Abbreviations with alternate expansions (discussed in a prior blog , are due for a remedy.

Despite advances in text processing software, no computational algorithms now exist that can accurately expand polysemous abbreviations from their sentence context. Polysemous abbreviations (abbreviations with alternate expansions) must be accompanied by their correct expansions in order to be understood correctly. Text markup languages (HTML, XML RDF) all support this kind of annotation. In HTML, there is evan a designated tag just for abbreviations:

http://www.w3schools.com/tags/tag_abbr.asp

Reports can be viewed to "show tags" or "hide tags" for the convenience of readers.

These kinds of solutions should be easy to implement in EMRs (Electronic Medical Records).

-Jules Berman

Tuesday, February 19, 2008

Python script adds an image description to a jpeg header

In September, I published a web document, with Bill Moore, that explained how images can be annotated with textual information that describes the image. The full document is distributed under a GNU license and is available at:

http://www.julesberman.info/spec2img.htm

Digital images that do not convey descriptions of their binary image have very little scientific value. The document has methods, in Perl and Ruby, for inserting textual information into image headers.

In today's blog, I show how textual information can be inserted and extracted from a jpeg image, using the Python programming language.

Chris Stromberger has prepared two public domain Python scripts in October, 2004:

http://www.fetidcascade.com/public/minimal_exif_writer.py

and

http://www.fetidcascade.com/public/minimal_exif_reader.py

These scripts use the exif header space, a popular method used by many manufactureres of digital cameras, to put textual information into the headers of digital images in popular formats, particularly jpeg.

http://www.exif.org/

Here is my script, exif.py, that uses both of Chris Stromberger's scripts to insert a description into a jpeg header and then to extract and display the text.


#!/usr/bin/python
import datetime
from minimal_exif_writer import MinimalExifWriter
file = open("hamlet.txt", "r")
text = "\nImage annotation date: " 
text = text + str(datetime.date.today())
text = text  + "\nImage description:\n"
text = text + file.read()
file.close()
f = MinimalExifWriter('trial.jpg')
f.removeExif()
f.newImageDescription(text)
f.newCopyright('Dr. Sympatico', addYear = 1)
f.process()
from minimal_exif_reader import MinimalExifReader
g = MinimalExifReader('trial.jpg')
print g.imageDescription()
print g.copyright()
print g.dateTimeOriginal()
exit

In this example, I built a string variable holding an image description composed of the contents of a small file (hamlet.txt) containing a few lines of Hamlet's soliloquy. Here is the output of exif.py


c:\>python exif.py
Image annotation date: 2008-02-19
Image description:
To be, or not to be--that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune
Or to take arms against a sea of troubles
And by opposing end them. To die, to sleep--
No more--and by a sleep to say we end
The heartache, and the thousand natural shocks
That flesh is heir to. 'Tis a consummation
Devoutly to be wished. To die, to sleep--
To sleep--perchance to dream: ay,
there's the rub,

2008 Dr. Sympatico

Please back up your original images before using them with this script. Some digital camera manufacturers use a proprietary (non-standard) header that can be corrupted when the exif.py script tries to add data to an expected exif partition.

Monday, February 18, 2008

Before you create a new standard....

Standards create problems. There are way too many of them, and closely-related standards are a source confusion. Often, IP (intellectual property) encumbers standards. Standards often disappear, greatly inconviencing adopters. "Big" standards tend to be dominated by powerful and wealthy corporations and often impose requirements that cannot reasonably be met by small companies or by individuals.

Before embarking on a new standard, committees should try to answer all of these questions:

1. Is there a pre-existing standard that covers the same technology?

2. If there is a pre-existing standard, can it be enhanced or modified to provide a desired functionality?

3. How much will it cost to develop the standard?

4. How long will the standards development process take?

5. Will the intended beneficiaries of the standard pay for the standards development process?

6. Who will develop the standard? Are the selected developers competent to produce an adequate standard?

7. Are any of the developers conflicted? Do they stand to profit if the standard is developed in a specific way?

8. Do any of the developers have proprietary software or data that they may wish to include in the standard?

9. Are the expected developers committed to work through the duration of the standards development process, and are they committed to providing all of the time and energy needed to develop the standard?

10. Will there be a mechanism whereby drafts of the standard are reviewed openly by the public? Will the minutes of the working committee be made public? Will public comments be used to modify successive drafts of the standard?

11. Will the standard have dependencies on other standards? If so, are there intellectual property issues that must be resolved before development begins? Will these issues require licenses or royalty agreements from the standards developers or the standards users?

12. Once created, is the standard likely to be adopted? Is the anticipated standard easily implemented?

13. Who will be the adopters of the standard? Are the expected standard adopters included in the development process for the standard?

14. Will the standard benefit a range of users beyond the standards developers?

15. What are the hazards that the standard may produce, and who might be hurt by the standard? In particular, will any entities be disadvantaged if they cannot readily adopt the standard?

16. Is it necessary to have the standard approved by an external organization?

17. If so, who will pay for the extra costs of obtaining approval from an external standards organization?

18. Will the standard need to be continuously updated and modified? Is there a planned process for producing multiple versions of the standard?

19. Is it really important to have the standard? Is it worth the effort?

Issues related to the development of new standards are discussed at length in my book, Biomedical Informatics.

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, medical standards

Sunday, February 17, 2008

Dangerous medical abbreviations

Almost all abbreviations have multiple different expansions. More often than not, it it easy for a human to disambiguate the meaning of an abbreviations that has alternate expansions. For example, you can distinguish AKA ("above knee amputation") from AKA ("also known as"). The context of a sentence determines the meaning.

However, there are many abbreviations that cannot easily be disambiguated, even by experts in a knowledge domain. These abbreviations sometimes arise from what I call "term-drift," wherein another, very similar term, with the same abbreviation, is mistakenly used, and where this misuse gains a foothold in medical culture.

Abbreviations that cannot always be disambiguated are particularly dangerous and are a potential source of medical errors. Here are some examples:

1. ABG aortic bifurcation graft, or aortobifemoral graft

2. AHA acquired hemolytic anemia, or autoimmune hemolytic anemia

3. ASCVD arteriosclerotic cardiovascular disease, or arteriosclerotic cerebrovascular disease

4. CHD congenital heart disease, or congestive heart disease, or coronary heart disease

5. DOA date of admission, or dead on arrival

6. EDC estimated date of conception, or estimated date of confinement ("due date" means almost the opposite of "conception date")

7. HZO herpes zoster ophthalmicus, or herpes zoster oticus

8. IBD inflammatory bowel disease, or irritable bowel disease

9. LLL left lower lid, or left lower lip, or left lower lobe, or left lower lung

10. MCGN mesangiocapillary glomerulonephritis or minimal change glomerulonephritis

11. MVR mitral valve regurgitation, or mitral valve repair, or mitral valve replacement

12. NC no change, or noncontributory

13. NKDA no known drug allergies, or nonketotic diabetic acidosis

14. PE pulmonary effusion, or pulmonary edema, or pulmonary embolectomy or pulmonary embolism

15. SK seborrheic keratosis, or solar keratosis

16. UVF ureterovaginal fistula, or urethrovaginal fistula

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, rare disease, disease genetics, medical abbreviations, disambiguation, dangerous abbreviations, medical nomenclature, medical transcription, electronic health record, ehr, emr, medical errors, medical mistakes, medical terminology, ambiguous terminology, terminology pitfalls and confusing terminology

Saturday, February 16, 2008

Informatics issues related to consenting medical data

Identified medical records can be used for research if the patients have given informed consent for a specified use of the data.

Institutions conducting human subject research with consented records should be able to answer these informatics-related issues, many of which involve tracking transaction data:

1 Does each consent form have an identifier and a locator, a study number, and a data element indicating that the consent form itself was approved by an IRB?

2 If needed, could you put your hands on the physical consent document?

3 Does your database indicate the specific study for which consent was approved?

4 Was the consent form sufficiently detailed, allowing the patient to approve certain uses of specimens/data and decline other uses?

5 Is each consent tagged with tracking data?

6 Was the consent approved or declined?

7 What day was the consent signed?

8 Does the institution have a policy that applies to situations wherein a subject cannot provide an informed consent (e.g., infants, patients with dementia)?

9 If the institution has a policy of excluding certain classes of patient from providing informed consent, has the institution received approval for the policy from its IRB?

10 For children and challenged subjects, was the informed consent document signed by a surrogate?

11 For children and challenged subjects, how is it determined who may act as a surrogate, and how is the identity of the surrogate recorded and tracked?

12 Did the consenting subject change her mind and withdraw consent after consent had been approved?

13 If consent was withdrawn, what date did this occur?

14 If consent was withdrawn, was consent withdrawn for a particular use of a specimen/data, or for all purposes described by the consent document?

15 If consent was withdrawn, does the withdrawal of consent apply to more than one consent form?

This list was excerpted from my book, Biomedical Informatics.

- Jules Berman tags: informed consent

Friday, February 15, 2008

How are diseases named?

Is there a general rule for naming human diseases? No. Here is a list of some of the many ways by which diseases get their names.

1 As an an expression of a characteristic pathologic process (e.g., muscular dystrophy)

2 For the physical agent that produced the disease (e.g., plumbism)

3 For a group of people who were at high risk for the disease (e.g., Legionnaires' Disease, named after a group of conventioneers who succumbed in an early outbreak)

4 For a molecule found in diseased cells (e.g. amyloidosis, prion disease)

5 For a geographic region in which occurrences of the disease are concentrated (e.g.,Tangier Disease from Tangier Island, Maryland)

6 For the geographic spot from which a widespread epidemic emanated (e.g., Lyme disease from Lyme, New York)

7 For a striking clinical feature of the disease (e.g.,sleeping sickness) )

8 As a crude and insensitive comparison to an non-human object (e.g., gargoylism, ichthyosis with confetti, happy puppet syndrome)

9 As a literary metaphor (e.g., Pickwickian syndrome, Mad Hatter's disease, Alice in Wonderland syndrome, Job's syndrome)

10 For a striking morphologic feature (e.g., sickle cell anemia)

11 For a patient who had the disease (e.g., Lou Gehrig disease)

12 For physician or scientist who treated, described or researched the disease (e.g., Hodgkin disease, Cushing disease, Kaposi sarcoma)

13 As a witty but unhelpful acronym (e.g. CATCH 22 = cardiac abnormality,abnormal facies, t-cell deficit due to thymic hypoplasia, cleft palate, hypocalcemia resulting from a deletion on chromosome 22)

14 As a trope or descriptive metaphor from any existing language (e.g., Moyamoya disease derives from "moyamoya" meaning "puff of smoke" in Japanese,for the characteristic tangle of tiny cerebral vessels seen on x-ray)

15 As a token of Greek or Latin scholarship (e.g., pityriasis lichenoides et varioliformis acuta)

16 As a somewhat obscure and trivial fact that would be understandable only to experts (e.g., one and a half syndrome, which refers to a specific neurologic condition in which one eye acquires movement deficits, while the other eye acquires half of those deficits)

17 As inscrutable combinations of one or more of the above (e.g., the wistful-sounding "floating-harbor syndrome," named by combining the hospital in which one of the first case appeared, Boston Floating Hospital, and for a second hospital in which another case appeared, Harbor General Hospital in Torrance, California)

This list was taken from my book, Biomedical Informatics (List 7.3.1).

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, medical nomenclature, names of diseases, disease terminology, pathology, logophile, medical metaphor, medical terminology, pathologic process, pathophysiology, anatomic pathology, naming diseases, names of diseases, literary medicine, history of medicine

Thursday, February 14, 2008

The importance of having a FAST medical autocoder

In the past few blogs, I've been writing about medical autocoders.

The medical informatics literature has lots of descriptions of medical autocoders, but most of these descriptions fail to include the speed of the autocoders.

It's been my experience that most published autocoders work at about 500 bytes per second. If a surgical pathology report is 1000 bytes (and I expect that this is roughly the length of a surgical pathology report), a report would take about 2 seconds to autocode.

The autocoder that I wrote about in the past few blogs works at about 100 kilobytes per second (i.e. 1 megabyte of text in ten seconds). For code simplicity, I didn't use the doublet method for this autocoder, and I think had I done so, it would have coded at about 1 Megabyte of text per second in Perl or Ruby (even faster in Python).

Why is it important to have a fast autocoder? Why can't you load your parser with a big file and let it run in the background, taking as long as it takes to finish?

There are three reasons why you absolutely must have a fast autocoder, and I discuss these in my book, Biomedical Informatics, and I thought I'd address the issue in this blog.

1. Medical files today are large. It is not unusual for a large medical center to generate a terabyte of data each week. A slow autocoder could never keep up with the volume of medical information that is produced each day.

2. Autocoders, and the nomenclatures they draw terms from, need to be modified to accommodate unexpected oddities in the text that they parse (particularly formatting oddities and the inclusion of idiosyncratic language to express medical terms). The cycles of running a programming, reviewing output, making modifications in software or nomenclatures, and repeating the whole process many times cannot be undertaken if you need to wait a week for your autocoding software to parse your text.

3. Autocoding is as much about re-coding as it is about the initial process of providing nomenclature codes.

You need to re-code (supply a new set of nomenclature codes for terms in your medical text) whenever you want to change from one nomenclature to another.

You need to re-code whenever you introduce a new version of a nomenclature.

You need to re-code whenever you want to use a new coding algorithm (e.g. parsimonious coding versus comprehensive, or linking code to a particular extracted portion of report)

You need to re-code whenever you add legacy data to your laboratory information systems.

You need to re-code whenever you merge different medical datasets (especially medical datasets that have been coded with different medical nomenclatures).

All of this re-coding adds to the data burden placed on a medical autocoder.

It has been my personal observation that computational tasks that take much time (more than a few seconds) tend to be put on the back burner. So many of the same observations would apply to medical deidentification software. Smart informaticians understand that program execution speed is always very important.

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

Wednesday, February 13, 2008

Ruby, Perl and Python medical autocoders

In the past two days on this blog, I've provided very short, fast, and accurate medical autocoders in Ruby and Perl. I thought I might as well offer the equivalent Python script. The Python script runs about twice as fast as either the Ruby or the Perl script.

The Ruby, Perl and Python scripts and their equivalent output are provided at:

http://www.julesberman.info/coded.htm

They are distributed under a GNU license.

All three scripts use a public domain file of 20,000 PubMed Citations, available at:

http://www.julesberman.info/tumorabs.txt

They all use an external tumor nomenclature contained within the Neoplasm Classification and available as a gzipped XML file distributed under a GNU license at:

http://www.julesberman.info/neoclxml.gz

- Jules Berman

Tuesday, February 12, 2008

Medical autocoding with Perl

In yesterday's blog, I showed a short, simple Ruby script that can provide quick and accurate medical autocoding for medical free-text. I also provided a web site where you could inspect 20,000 PubMed abstract titles and the extracted/coded terms produced by the Ruby autocoder.

Today, I'm providing a web site with the equivalent Perl medical autocoder, along with the public domain output file of 20,000 autocoded PubMed abstracts. Surprisingly (to me) the Perl code executed at about the same speed as the Ruby code. Both autocoders would have significant speed gains if they used the doublet method (which I didn't use here because I wanted to demonstrate the shortest possible scripts). The Perl code is contained on the web page.

- Jules Berman

Monday, February 11, 2008

Fast, accurate medical autocoding with Ruby

In the field of biomedical informatics, it is often necessary to extract medical terms from text and attach a nomenclature concept code to the extracted term. By doing so, concepts of interest contained in text can be retrieved regardless of the choice of words used to describe a concept. For example, hepatocellular carcinoma, liver cell cancer, liver cancer, and hcc might all be given the same code number in a neoplasm nomenclature. Documents using any of these terms can be collected and merged if all of the terms are annotated with the same concept code.

Many people think that it is difficult to write autocoding software [that can parse text, extract terms, and code terms].

Many people think that it is impossible to write fast autocoding software. People accept autocoder speeds that code a typical pathology report at a rate of 1 report (about 1 kilobyte) per second.

Both of these notions are false. A superb autocoder can be written in a few dozen lines of Ruby code. This short coder is fast, coding 20,000 citations in about 21 seconds on a 2.8 GHz desktop CPU with 512 Megabytes RAM). This is a rate of about 100 kilobytes per second. A faster (but more complex) coder has been written by the author using the doublet method.

The output of the coder is virtually perfect. I have prepared a web file that permits anyone to browse through 20,000 abstract titles and inspect the named neoplasms in the abstract text that were coded by the Ruby script. It is available at:

http://www.julesberman.info/coded.htm"

Doubters can autocode the same list of abstract titles to determine if they can write an autocoder that is as simple, fast or accurate as this short Ruby autocoder.

Here is my Ruby script. For more information about using Ruby for autocoding and for many other biomedical projects, you may want to read my Ruby book.

As with all of my scripts, the following disclaimer applies. This script is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

Note that this script requires two external files, neocl.xml, the neoplasm classification in XML format, available for download as a gzipped file from:

http://www.julesberman.info/neoclxml.gz

It also requires tumorabs.txt, available at:

http://www.julesberman.info/tumorabs.txt


#!/usr/bin/ruby
text = File.open("neocl.xml", "r")
literalhash = Hash.new
text.each do
    |line|
    next if (line !~ /\"(C[0-9]{7})\"/)
    line =~ /\"(C[0-9]{7})\"/
    code = $1;
    line =~ /\"\> ?(.+) ?\<\//
    phrase = $1;
    if (phrase =~ /[a-z]/) 
        literalhash[phrase] = code
        #puts phrase
    end
end
text.close
puts "Neoplasm code hash has been created.  Autocoding will start now"
absfile = File.open("tumorabs.txt", "r")
outfile = File.open("tumorabs.out", "w")
absfile.each do
   |sentence|
   sentence.chomp!
   sentence.gsub!(/omas/, "oma")
   sentence.gsub!(/tumo[u]?rs/, "tumor")
   outfile.puts "\nAbstract title..." + sentence.capitalize + "."
   cum_array = Array.new
   sentence_array = sentence.split
   length = sentence_array.size
   length.times do
      (1..sentence_array.size).each do
         |place_length|
         phrase = sentence_array.slice(0,place_length).join(" ")
         if literalhash.has_key?(phrase)
            outfile.puts "Neoplasm term..." + phrase.capitalize + " " + literalhash[phrase]
         end 
      end
   sentence_array.shift
   end
end
exit

Here is an example citation, followed by autocoded neoplasm terms.

Abstract title...Obstructive jaundice associated burkitt lymphoma mimicking pancreatic carcinoma.
Neoplasm term...Jaundice C0000000
Neoplasm term...Burkitt lymphoma C7188000
Neoplasm term...Lymphoma C7065000
Neoplasm term...Pancreatic carcinoma C3850000
Neoplasm term...Carcinoma C0000000

Note that only names of neoplasms are coded. Neoplasm-related terms (not strictly the name of any particular neoplasms, are captures with the general code, C0000000.

- Jules Berman

Sunday, February 10, 2008

Update of Neoplasm Classification

An update for the Neoplasm Classification, an open access document distributed under the GNU Free Documentation License, is now available as a gzipped XML file:

NEOCLXML.GZ 719,099 bytes

The latest version of the Neoplasm Classification contains over 146,400 different terms, of which 130,482 are classified names of neoplasms listed under 5,855 concepts.

An explanation of the classification is found in the beginning of the file.

This is the world's largest and most comprehensive listing of neoplasm names and is intended for use in biomedical informatics research and cancer research.

-Jules Berman tags: medical nomenclature, terminology

Correction to Ruby Programming for Medicine and Biology

Sorting the lines in large files is a frequent task for informaticians (e.g., sorting the lines from a long list of terms, with a different term on each line of the file).

For very large files, the built-in sort function of programming languages just cannot do the job because the lines are put into an array (held in memory), and even computers with lots of memory tend to choke.

An easy short-cut involves only sorting the first few characters of each line (10 characters in the script provided below), instead of the entire line. In this way, the array of lines from the file can be shortened to ten characters per line, and this saves lots of memory.

I provided a short Ruby script (bigsort.pl) in my book, Ruby Programming for Medicine and Biology. The bigsort.pl script on page 150 of Ruby Programming for Medicine and Biology, has a few quirks. First, it assumes that the text file (to be sorted) is a DOS-style file with a two character (carriage-return,line-feed) linebreak. Also, it assumes that every line (in the file to be sorted) contains alphanumeric text.

Provided here is a minor modification to the bigsort.pl Ruby script. It should work for any type of text file and does not require text to appear on each line of the file that is being sorted. As with all my posted scripts, the method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

Thanks goes to Dr. Tim Rand, who spotted the error and sent me his own version of a fix on February 6, 2008.


#!/usr/local/bin/ruby
text = File.open("terms.txt", "r")
out = File.open("terms.put", "w")
linearray = Array.new
begin_position = 0
text.each_line do
    |line|
    old_position = begin_position
    begin_position = text.pos
    line = line.chomp! + "          " #pad ten spaces
    linearray << line.slice(0..9) + old_position.to_s
end
linearray.sort!
linearray.each do
  |value|
  seekplace = value.slice(10..20).to_i
  text.seek(seekplace, IO::SEEK_SET)
  out.puts(text.readline)
end
exit

-Jules Berman