Tuesday, March 29, 2016

CLASS BLENDING: Simpson's Paradox

For the past two days, we've been posting on Class Blending. Simpson's paradox is a special case that demonstrates what may happen when classes of information are blended.

Simpson's paradox is a well-known problem for statisticians. The paradox is based on the observation that findings that apply to each of two data sets may be reversed when the two data sets are combined.

One of the most famous examples of Simpson's paradox was demonstrated in the 1973 Berkeley gender bias study RbicaR. A preliminary review of admissions data indicated that women had a lower admissions rate than men:
Men    Number of applicants.. 8,442   Percent applicants admitted.. 44%
Women  Number of applicants.. 4,321   Percent applicants admitted.. 35%
A nearly 10% difference is highly significant, but what does it mean? Was the admissions office guilty of gender bias?

A closer look at admissions department-by-department showed a very different story. Women were being admitted at higher rates than men, in almost every department. The department-by-department data seemed incompatible with the combined data.

The explanation was simple. Women tended to apply to the most popular and oversubscribed departments, such as English and History, that had a high rate of admission denials. Men tended to apply to departments that the women of 1973 avoided, such as mathematics, engineering and physics. Men tended not to apply to the high occupancy departments that women preferred. Though women had an equal footing with men in departmental admissions, the high rate of women rejections in the large, high-rejection departments, accounted for an overall lower acceptance rate for women at Berkeley.

Simpson's paradox demonstrates that data is not additive. It also shows us that data is not transitive; you cannot make inferences based on subset comparisons. For example in randomized drug trials, you cannot assume that if drug A tests better than drug B, and drug B tests better than drug C, then drug A will test better than drug C (1). When drugs are tested, even in well-designed trials, the test populations are drawn from a general population specific for the trial. When you compare results from different trials, you can never be sure whether the different sets of subjects are comparable. Each set may contain individuals whose responses to a third drug are unpredictable. Transitive inferences (i.e., if A is better than B, and B is better than C, then A is better than C), are unreliable.

- Jules Berman (copyrighted material)

key words: data science, irreproducible results, complexity, classification, ontology, ontologies, classifications, data simplification, jules j berman


Baker SG, Kramer BS. The transitive fallacy for randomized trials: If A bests B and B bests C in separate trials, is A better than C? BMC Medical Research Methodology 2:13, 2002

Monday, March 28, 2016

Open for Comment

For the past several years, I've kept this blog closed to comments. Prior to that, most of the comments were thinly disguised advertisements for pharmaceuticals, and I got tired of rejecting them. For the past month or so, I've re-opened the blog for readers' comments, without announcing the change; just to check whether I'd be inundated with computer-generated promotions.

It seems that neither software agents or readers have noticed the change. So, please, if you are a human and would like to send a comment, feel free to use the comments link at the bottom of every post. Your comments will be moderated, but I intend to approve all blogs that are created by humans, and that do not promote products or services or other web sites. You are also welcome to comment on prior posts that appeared in the past month.

- Jules Berman

key words: blog, postings, comments, announcement, jules j berman

Sunday, March 27, 2016

Expunging a Blended Class: The Fall of Kingdom Protozoa

In yesterday's blog, we introduced and defined the term "Class blending". Today's blog extends this discussion by describing the most significant and most enduring class blending error to impact the natural sciences: the artifactual blending of all single cell organisms into the blended class, Protozoa.

For well over a century, biologists had a very simple way of organizing the eukaryotes (i.e., the organisms that were not bacteria, whose cells contained a nucleus) (1). Basically, the one-celled organisms were all lumped into one biological class, the protozoans (also called protists). With the exception of animals and plants, and some of the fungi (e.g., mushrooms), life on earth is unicellular. The idea of lumping every type of unicellular organism into one class, having shared properties, shared ancestry, and shared descendants, made no sense. What's more, the leading taxonomists of the nineteenth century, such as Ernst Haeckel (1834 - 1919), understood the class Protozoa was at best, a temporary grab-bag holding unrelated organisms that would eventually be split into their own classes. Well, a century passed, and complacent taxonomists preserved the Protozoan class. In the 1950s, Robert Whittaker elevated Class Protozoa as a kingdom in his broad new "Five Kingdom" classification of living organisms (2). This classification (more accurately, misclassification) persisted through the last five decades of the twentieth century.

Modern classifications, based on genetics, metabolic pathways, shared morphologic features, and evolutionary lineage, have dispensed with Class Protozoa, assigning each individual class of eukaryotes to its own hierarchical position. A simple schema demonstrates the modern classification of eukaryotes (3). Many modern taxonomists are busy improving this fluid list (vida infra), but, most significantly, Class Protozoa is nowhere to be found.
Eukaryota (organisms that have nucleated cells)
  Bikonta (2-flagella)
    Archaeplastida, from which Kingdom Plantae derives
Why is it important to expunge Class Protozoa from modern classifications of living organisms? Every class of living organism contains members that are pathogenic to other classes of organisms. To the point, most classes of organisms contain members that are pathogenic to humans, or to the organisms that humans depend on for their existence (e.g., other animals, food plants, beneficial organisms). There are way too many species of pathogens for us to develop specific drugs and techniques to control the growth of each disease-causing organism. Our only hope is to develop general treatments for classes of organisms, that share the same properties; hence the same weaknesses. For example, in theory, it's much easier to develop drugs that work on Apicomplexans that it is to develop separate drugs that work on each pathogenic species of Apicomplexan (3).

By lumping every single-celled organisms into one blended class, we have missed the opportunity to develop true class-based remedies for the most elusive disease-causing organisms on our planet. The past two decades have seen enormous progress in reclassifying the former protozoans. Unfortunately, the errors of the past are repeated in textbooks and dictionaries.

Here are three definitions of protozoa that I found on the web. Notice that these definitions don't even agree with one another. Notice that the first definition includes single celled organisms that may be free-living or parasitic. The second definition indicates that protozoans are obligate intracellular organisms. The third definition indicates that some protozoans are pathogenic in animals but omits mention of pathogenicity for other types of organisms. None of the definitions tell us that modern taxonomists have abandoned "protozoa" as a bona fide class of organisms.

from: http://www.dictionary.com/browse/protozoan
Protozoan: Any of a large group of one-celled organisms (called protists) that live in water or as parasites. Many protozoans move about by means of appendages known as cilia or flagella. Protozoans include the amoebas, flagellates, foraminiferans, and ciliates.

from: www.medicinenet.com/script/main/art.asp?articlekey=5091
Protozoa: A parasitic single-celled organism that can divide only within a host organism. For example, malaria is caused by the protozoa Plasmodium.

from: http://www.merriam-webster.com/dictionary/protozoan
Protozoan: any of a phylum or subkingdom (Protozoa) of chiefly motile and heterotrophic unicellular protists (as amoebas, trypanosomes, sporozoans, and paramecia) that are represented in almost every kind of habitat and include some pathogenic parasites of humans and domestic animals.


[1] Scamardella JM. Not plants or animals: a brief history of the origin of Kingdoms Protozoa, Protista and Protoctista. Internatl Microbiol 2:207-216, 1999.

[2] Hagen JB. Five kingdoms, more or less: Robert Whittaker and the broad classification of organisms. BioScience 62:67-74, 2012.

[3] Berman JJ. Taxonomic Guide to Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms. Academic Press, Waltham, 2012.

- Jules Berman (copyrighted material)

key words: data science, irreproducible results, complexity, classification, ontology, ontologies, protozoa, Apicomplexa, protists, protoctista,jules j berman

Saturday, March 26, 2016

Intro to Class Blending

I thought I'd devote the next few blogs to a concept that has gotten much less attention than it deserves: blended classes. Class blending lurks behind much of the irreproducibility in "Big Science" research, including clinical trials. It also is responsible for impeding progress in various disciplines of science, particularly the natural sciences, where classification is of utmost importance. We'll see that the scientific literature is rife with research of dubious quality, based on poorly designed classifications and blended classes.

For today, let's start with a definition and one example. We'll discuss many more specific examples in future blogs.

Blended class - Also known as class noise, subsumes the more familiar, but less precise term, "Labeling error." Blended class refers to inaccuracies (e.g., misleading results) introduced in the analysis of data due to errors in class assignments (i.e., assigning a data object to class A when the object should have been assigned to class B). If you are testing the effectiveness of an antibiotic on a class of people with bacterial pneumonia, the accuracy of your results will be forfeit when your study population includes subjects with viral pneumonia, or smoking-related lung damage. Errors induced by blending classes are often overlooked by data analysts who incorrectly assume that the experiment was designed to ensure that each data group is composed of a uniform and representative population. A common source of class blending occurs when the classification upon which the experiment is designed is itself blended. For example, imagine that you are a cancer researcher and you want to perform a study of patients with malignant fibrous histiocytomas (MFH), comparing the clinical course of these patients with the clinical course of patients who have other types of tumors. Let's imagine that the class of tumors known as MFH does not actually exist; that it is a grab-bag term erroneously assigned to a variety of other tumors that happened to look similar to one another. This being the case, it would be impossible to produce any valid results based on a study of patients diagnosed as MFH. The results would be a biased and irreproducible cacaphony of data collected across different, and undetermined, species of tumors. This specific example, of the blended MFH class of tumors, is selected from the real-life annals of tumor biology (1), (2).


[1] Al-Agha OM, Igbokwe AA. Malignant fibrous histiocytoma: between the past and the present. Arch Pathol Lab Med 132:1030-1035, 2008.

[2] Nakayama R, Nemoto T, Takahashi H, Ohta T, Kawai A, Seki K, et al. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Modern Pathology 20:749-759, 2007.

- Jules Berman (copyrighted material)

key words: data science, irreproducible results, complexity, classification, ontology, ontologies, jules j berman

Friday, March 25, 2016

Progress against cancer? Let's think about it.

It is difficult to pick up a newspaper these days without reading an article proclaiming progress in the field of cancer research. Here is an example, taken from an article posted on the MedicineNet site (1). The lead-off text is: "Statistics (released in 1997) show that cancer patients are living longer and even "beating" the disease. Information released at an AMA sponsored conference for science writers, showed that the death rate from the dreaded disease has decreased by three percent in the last few years. In the 1940s only one patient in four survived on the average. By the 1960s, that figure was up to one in three, and now has reached 50% survival."

Optimism is not confined to the lay press. In 2003, then NCI Director Andrew von Eschenbach, announced that the NCI intended to "eliminate death and suffering" from cancer by 2015 (2), (3). Update: it's 2016 and still no cancer cure.

Bullish assessments for progress against cancer are a bit misleading. There is ample historical data showing that the death rate from cancer has been rising throughout the twentieth century, and that the burden of new cancer cases will rise throughout the first half of the twenty-first century (4). If you confine your attention to the advanced common cancers (the cancers that cause the greatest number of deaths in humans), we find that the same common cancers that were responsible for the greatest numbers of deaths in 1950 are the same cancers killing us today, and at about the same rates (5), (6). Furthermore, the age-adjusted cancer death rate, the only valid measurement of progress against cancer, is about the same today as it was in 1950 (7). According the the U.S. National Center for Health Statistics, the age-adjusted cancer death rate in 1950 was 194 deaths per 100,000 population (8). In 2004, the death rate was the same, 194 per 100,000 population (8). Hardly an occasion for celebration.

In 1971, President Richard M. Nixon signed the National Cancer Act into law, marking the year that the United States launched its War on Cancer. For the next two decades, the U. S. cancer death rate rose steadily. Then in 1991, the U. S. cancer death rate began to decline, incrementally. It is tempting to conclude that 1991 marked the beginning of victory in our war against cancer, and that the steady, incremental declines in U. S. cancer death rates will continue in future decades, until cancer is fully eradicated. The decline in the cancer rate since 1991 is counter-balanced by the rise in the rate of cancer deaths between 1975 and 1991. What accounts for the rise in cancer deaths after 1975 and the restoration of the 1975 rates following 1991? There's no mystery here. The rise was due to smoking; the fall was due to smoking cessation (4). The post-1991 drop in the U.S. cancer death rate has only served to bring us full circle to our 1950 cancer death rate.

You may be thinking that cancer is a difficult problem, but at least the U.S. is working on the leading edge of cancer care. If cancer is a problem for us, it must be must worse for all the underdeveloped countries in the world. Nope. The U.S. has a high cancer death rate when compared to other countries (9). Kuwait, Panama, Ecuador, Mexico and Thailand have a much lower cancer death rate than the United States. American citizens intent on lowering their cancer death rate would be better off immigrating across the border, to Mexico, than waiting for the U.S. win its war against cancer.

Despite the many billions of dollars spent on research and treatment for cancer, we have made negligible progress toward reducing the number of people who die each year from cancer. The reason that cancer organizations can announce major gains against cancer and can promise to eliminate cancer deaths by 2015 is due entirely to the magic of data misinterpretation!

To see how the deception works in the cancer field, you need to start with the definition of "survival." To a layperson, the term "survival" indicates avoidance of death. For example, the survivors of a plane crash are the people who did not die in the crash. To an oncologist, survival is the time interval between diagnosis and death. Suppose that oncologists announce that a new treatment of pancreatic cancer produces a 1% increase in survival. Layman will interpret this to mean that a person with pancreatic cancer will have a 1 in 100 chance of being cured of his cancer above and beyond his chances for cure with the older treatment. To most people with cancer, that 1 in 100 improvement, though small, is worth any price. Unfortunately, this is not the case at all. To the oncologists who made the announcement, a 1% increase in survival indicates that if the life expectancy following diagnosis of pancreatic cancer is 100 days, then the life expectancy following diagnosis with the new treatment is 101 days. In either case, most patients with advanced pancreatic cancer will die. The patients receiving the new treatment may reasonably expect to survive a bit longer (in this hypothetical case, an average of one day longer).

You may be asking yourself about the validity of claims that we can now cure many childhood cancers that could not be cured in prior generations. Thankfully, these claims are true and accurate. Many children with cancer can now be cured. However, the overall incidence of childhood cancers has risen 36% since 1976 (10). This rise in childhood cancer incidence has erased about half of the overall benefits from the rising cure rates.

Real progress has been made towards curing rare cancers, such as gastrointestinal stromal tumors (GISTs), chronic myelocytic leukemia, and Hodgkin Disease. There is a biological reason why the rare cancers are easier to cure than the common cancers, and this fascinating topic is discussed in detail in my book, Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases. In a nutshell, research into the genetics of tumors has shown us that some cancers are characterized by simple genetic errors. It turns out that the tumors with simple genetic errors coincide with the rare tumors of childhood and certain rare tumors of adults. The small number of gene alterations in these rare tumors permits us to effectively target chemotherapeutic agents against a single vulnerable metabolic pathway. Complex common cancers may share key metabolic pathways with simple rare cancers, but it will take a while before we can effectively use this knowledge to develop effective treatments for the common cancers.

Cancer projections provided by the NCI's SEER program (the National Cancer Institute's Surveillance, Epidemiology, and End Results), indicate that between the years 2000 and 2050, the number of new cancer cases per year will more than double, from 1.3 million new cases in 2000 to 2.8 million new cases in 2050 (11). The projected yearly increase in cancer cases, if unchecked, will put additional strain on the wobbly American healthcare system.

After hundreds of billions of dollars were spent on cancer research and cancer treatment, with little to show for the effort, why did any of us believe that the dying would end by 2015? Humans live in hope; we would rather believe a hopeful lie than a hopeless truth.

- Jules Berman (copyrighted material)

key words: cancer, rare diseases, orphan diseases, cancer cure, cancer treatments, progress in cancer research, cancer statistics, jules j berman


[1] MedicineNet. Better and Longer Survival for Cancer Patients. Available from: http://www.medicinenet.com/script/main/art.asp?articlekey=157

[2] Kaiser J. NCI Goal Aims for Cancer Victory by 2015. Science 299:1297-1298, 2003.

[3] Eschenbach AC. NCI sets goal of eliminating suffering and death due to cancer by 2015. Journal of the National Medical Association 95:637-639, 2003.

[4] Berman JJ. Precancer: The Beginning and the End of Cancer. Jones and Bartlett, Sudbury, 2010.

[5] Bailar JC, Gornik HL. Cancer undefeated. N Engl J Med 336:1569-1574, 1997.

[6] Leaf C. Why We're Losing The War On Cancer: And How To Win It. Fortune Magazine, March 22, 2004.

[7] Hoyert DL, Heron MP, Murphy SL, Kung H-C. Final Data for 2003. National Vital Statistics Report. 54:(13), April 19, 2006.

[8] Health, United States, 2004. National Center for Health Statistics, Hyattsville, Maryland, 2004.

[9] Ferlay J, Soerjomataram I, Ervik M, Dikshit R, Eser S, Mathers C, et al. GLOBOCAN 2012 v1.0, Cancer Incidence and Mortality Worldwide: IARC CancerBase No. 11. Lyon, France: International Agency for Research on Cancer, 2013.

[10] Ries LAG, Smith MA, Gurney JG, Linet M, Tamra T, Young JL, et al. Cancer Incidence and Survival among Children and Adolescents: United States SEER Program 1975-1995, National Cancer Institute, SEER Program. NIH Pub. No. 99-4649. Bethesda, MD, 1999.

[11] Hayat MJ, Howlader N, Reichman ME, Edwards BK. Cancer Statistics, Trends, and Multiple Primary Cancer Analyses from the Surveillance, Epidemiology, and End Results (SEER) Program. The Oncologist 12:20-37, 2007.

Thursday, March 24, 2016

Scientific Misconduct at Prestigious Research Centers

On January 23, 2009, the Office of Research Integrity made public their findings of scientific misconduct concerning a doctor who fabricated data for several grants projects funded by the NIH (1). The doctor was a former graduate student in the Department of Pathology, Harvard Medical School, a former research fellow and Instructor of Pathology, at Brigham and Women's Hospital in Boston, a former postdoctoral fellow in the Department of Biology, at the California Institute of Technology, and a former Associate Professor in the Department of Biology and the Center for Cancer Research at the Massachusetts Institute of Technology. He had worked on numerous NIH grants, and was found to have fabricated data supporting applications for five NIH grants. It is difficult to imagine a person better prepared for a life of scientific integrity. From his pre-doctoral training, through his post-doctoral research and his academic appointment, he was nurtured in the finest environments, by some of the most respected scientists on the planet.

The Office of Research Integrity makes its findings a matter of public record. You can visit their web site and read the individual reports of misconduct (2). Here are some findings of misconduct, issued by the Office of Research Integrity, involving prestigious universities:

1. "[A] research program coordinator in the Oncology Center, The Johns Hopkins University School of Medicine, engaged in scientific misconduct by fabricating patient interview data for a study of quality of life measures in cancer patients. Further, the same research program coordinator, "engaged in scientific misconduct by falsifying patient status data by failing to update the status of treated breast cancer patients and misrepresenting data from previous contacts as the updated status for a study." (3)

2. An Assistant Professor in the Department of Psychology at Harvard University was found to fabricate data in a number of different experiments that were described in journal publications. The doctor retracted the published papers in a letter that included the following language, "because I improperly excluded some participants who should have been included in the analyses and that this exclusion affected the reported results. Moreover, the improper exclusion of data was solely my doing and was not contributed to or known by my coauthors." (4)

3. An Assistant Professor in the Yale University School of Medicine "committed scientific misconduct by plagiarizing and intentionally misrepresenting research in an application for Public Health Service (PHS) funded research supported by grant application 1 R24 RR05358-01" (5).

It is easy to find outrageous incidents that occur in the most respected universities and corporations (6), (7), (8), (9), (10), (11), (12), (13), (14), (15), (16). Reign in your astonishment. If you are the kind of person who is motivated by prestige, money and power, then you may gravitate to the places where prestige, money and power are found.

This headline appeared in today's The Guardian, concerning a research scandal at Sweden's famed Karolinska Institute: "'Superstar doctor' fired from Swedish institute over research 'lies'" (17). A renowned scientist and surgeon had committed a variety of scientific frauds, which have called for the retraction of prior "breakthrough" results. In a statement, the Karolinska indicated that the physician's contract would be rescinded for reasons that included apparent scientific negligence and the falsification of his CV." The full extent of the scientist's activities are covered in the article (17). I won't burden you with the sordid details, but The Guardian article is definitely worth reading.


[1] Office of Research Integrity Available from: http://grants.nih.gov/grants/guide/notice-files/NOT-OD-09-040.html Jan 23, 2009.

[2] Office of Research Integrity. Available from: http://ori.dhhs.gov

[3] Findings of Scientific Misconduct. NIH GUIDE, Volume 26, Number 15, May 9, 1997. Available from: http://grants.nih.gov/grants/guide/notice-files/not97-097.html

[4] Findings of Scientific Misconduct. NOT-OD-02-020. December 13, 2001. Available from: http://grants.nih.gov/grants/guide/notice-files/NOT-OD-02-020.html

[5] Findings of Scientific Misconduct. NIH GUIDE, Volume 24, Number 33, September 22, 1995. Available from: http://grants.nih.gov/grants/guide/notice-files/not95-208.html

[6] Hajra A, Collins FS. Structure of the leukemia-associated human CBFB gene. Genomics 26:571-579, 1995.

[7] Altman LK. Falsified data found in gene studies. The New York Times October 30, 1996.

[8] Findings of scientific misconduct. NIH Guide Volume 26, Number 23, July 18, 1997 Available from: http://grants.nih.gov/grants/guide/notice-files/not97-151.html

[9] Bren L. Human Research Reinstated at Johns Hopkins, With Conditions. U.S. Food and Drug Administration, FDA Consumer magazine, September-October, 2001.

[10] Kolata G. Johns Hopkins Admits Fault in Fatal Experiment. The New York Times July 17, 2001.

[11] Brooks D. The Chosen: Getting in. The New York Times, November 6, 2005.

[12] Seward Z. MIT Admissions dean resigns; admits misleading school on credentials degrees from three colleges were fabricated, MIT says. Harvard Crimson, April 26, 2007.

[13] Available from: http://en.wikipedia.org/wiki/Jesse_Gelsinger. Comment. This Wikipedia article recounts the tragic death of Jesse Gelsinger, a volunteer gene theraphy experiment at the University of Pennsylvania.

[14] Salmon A, Hawkes N. Clone 'hero' resigns after scandal over donor eggs. The Times, November 25, 2005.

[15] Wilson D. Harvard Medical School in Ethics Quandary. The New York Times March 3, 2009.

[16] Findings of Scientific Misconduct. NOT-OD-05-009. November 22, 2004. Available from: http://grants.nih.gov/grants/guide/notice-files/NOT-OD-05-009.html

[17] Oltermann P. 'Superstar doctor' fired from Swedish institute over research 'lies'. The Guardian March 24, 2016. Available at: https://www.theguardian.com/science/2016/mar/23/superstar-doctor-fired-from-swedish-institute-over-research-lies-allegations-windpipe-surgery, viewed March 24, 2016.

- Jules Berman (copyrighted material)

key words: ethics, fraud, scientific misconduct, ORI, Karolinska Institute, Machiavelli's Laboratory, Machiavelli's Lab jules j berman

Tuesday, March 22, 2016

The Importance of Biological Taxonomy

Biological taxonomy is the scientific field dealing with the classification of living organisms. Non-biologists who give any thought to taxonomy, may think that the field is the dullest of the sciences. To the uninitiated, there is little difference between the life of a taxonomist and the life of a stamp collector. Nothing could be further from the truth. Taxonomy has become the grand unifying theory of the biological sciences. Efforts to sequence the genomes of prokaryotic, eukaryotic and viral species, thereby comparing the genomes of different classes of organisms, have revitalized the field of evolutionary taxonomy (phylogenetics). The analysis of normal and abnormal homologous genes in related classes of organisms have inspired new disease treatments targeted against specific molecules and pathways characteristic of species or classes or organisms. Students who do not understand the principles of modern taxonomy have little chance of perceiving the connections between medicine, genetics, pharmacology, or pathology, to say nothing of clinical microbiology.

Here are two of the specific advantages of learning the taxonomy of infectious diseases.
1. As a method to drive down the complexity of medical microbiology

Learning all the infectious diseases of humans is an impossible task. As the number of chronically ill and immune-compromised patients has increased, so have the number of opportunistic pathogens. As global transportation has become commonplace, the number of exotic infections spread worldwide has also increased (See Glossary item, Exotic diseases in the United States). A few decades ago, infectious disease experts were expected to learn a few hundred infectious diseases. Today, there are over 1400 organisms that can cause diseases in humans, and the number is climbing rapidly, while the techniques to diagnose and treat these organisms are constantly improving. Textbooks cannot cover all these organisms in sufficient detail to provide healthcare workers with the expertise to provide adequate care to their patients.

How can any clinician learn all that is needed to provide competent care to patients? The first step in understanding infectious diseases is to understand the classification of pathogenic organisms. Every known disease-causing organisms has been assigned to one of 40 well-defined classes of organisms, and each class fits within a simple ancestral lineage. This means that every known pathogenic organism inherits certain properties from its ancestral classes and shares these properties with the other members of its own class. When you learn the class properties, along with some basic information about the infectious members of the classes, you gain a comprehensive understanding of medical microbiology.

2. As protection against professional obsolescence

There seems to be so much occurring the the biological sciences, it is just impossible to keep on top of things. With each passing day, you feel less in tune with modern science, and you wish you could return to a time when a few fundamental principles grounded your chosen discipline. You will be happy to learn that science is all about finding generalizations among data or among connected systems (i.e., reducing the complexity of data or finding simple explanations for systems of irreducible complexity). Much, if not all, of the perceived complexity of the biological sciences derives from the growing connections of once-separate disciplines: cell biology, ecology, evolution, climatology, molecular biology, pharmacology, genetics, computer sciences, paleontology, pathology, statistics, and so on. Scientists today must understand many different fields, and they must be willing and able to absorb additional disciplines, throughout their careers. As each field of science becomes entangled with other the seemingly arcane field of biological taxonomy has gained prominence because it occupies the intellectual core of virtually every biological field.

Modern biology is data-driven. A deluge of organism-based genomic, proteomic, metabolomic and other "omic" data is flooding our data banks and drowning our scientists. This data will have limited scientific value if we cannot find a way to generalize the data collected for each organism to the data collected in other organisms. Taxonomy is the scientific method that reveals how different organisms are related. Without taxonomy, data has no biological meaning.

The discoveries that scientists make in the future will come from questions that arise during the construction and refinement of biological taxonomy. In the case of infectious diseases, when we find a trait that informs us that what we thought was a single species is actually two species, it permits us to develop treatments optimized for each species, and to develop new methods to monitor and control the spread of both organisms. When we correctly group organisms within a common class, we can test and develop new drugs that are effective against all of the organisms within the class, particularly if those organisms are characterized by a molecule, pathway or trait that is specifically targeted by a drug. Terms used in diverse sciences, such as homology, metabolic pathway, target molecule, acquired resistance, developmental stage, cladistics, monophyly, model organism, class property, phylogeny, all derive their meaning and their utility from biological taxonomy. When you grasp the general organization of living organisms, you will understand how different scientific fields relate to each other, thus avoiding professional obsolescence.

- Jules Berman (copyrighted material)

key words: taxonomy, evolution, classification, data organization, jules j berman

Monday, March 21, 2016


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

On March 17, my book Data Simplification: Taming Information with Open Source Tools was published by Morgan Kaufmann, an imprint of Elsevier. [the Elsevier site indicates that the book is still on preorder, buy you can ignore that]. This past month, I've posted on topics relevant to data simplification. Beginning tomorrow, I'll be moving onto new subjects for this blog site, but I wanted to make one additional comment for anyone who might be on the fence about buying this book.

Most large data projects are total failures (1-21). Furthermore, in my humble opinion, most data projects that are deemed successes at the time of completion are actually failures of a kind, because the data that was collected during the project was abandoned when the project ended. Data shouldn't die. Data should be prepared in a manner that permits anyone (not just the people who planned the project) to confirm the conclusions, to reanalyze the data, to merge the data with other data sources, and to repurpose the data for future projects. To do so, the data must be prepared in a manner that is comprehensible and simplified. My book provides open source tools for creating data that can be used and repurposed, by generations of data scientists.

Enough said! Tomorrow, we move on.

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data science, information science, simplifying data, taming data, jules j berman


[1] Kappelman LA, McKeeman R, Lixuan Zhang L. Early warning signs of IT project failure: the dominant dozen. Information Systems Management 23:31-36, 2006.

[2] Arquilla J. The Pentagon's biggest boondoggles. The New York Times (Opinion Pages) March 12, 2011.

[3] Lohr S. Lessons From Britain's Health Information Technology Fiasco. The New York Times Sept. 27, 2011.

[4] Dismantling the NHS national programme for IT. Department of Health Media Centre Press Release. September 22, 2011. Available from: http://mediacentre.dh.gov.uk/2011/09/22/dismantling-the-nhs-national-programme-for-it/ viewed June 12, 2012.

[5] Whittaker Z. UK's delayed national health IT programme officially scrapped. ZDNet September 22, 2011.

[6] Lohr S. Google to end health records service after it fails to attract users. The New York Times Jun 24, 2011.

[7] An assessment of the impact of the NCI cancer Biomedical Informatics Grid (caBIG). Report of the Board of Scientific Advisors Ad Hoc Working Group, National Cancer Institute, March, 2011.

[8] Heeks R, Mundy D, Salazar A. Why health care information systems succeed or fail. Institute for Development Policy and Management, University of Manchester, June 1999 Available from: http://www.sed.manchester.ac.uk/idpm/research/publications/wp/igovernment/igov_wp09.htm, viewed July 12, 2012.

[9] Brooks FP. No silver bullet: essence and accidents of software engineering. Computer 20:10-19, 1987.

[10] Unreliable research: Trouble at the lab. The Economist October 19, 2013.

[11] Kolata G. Cancer fight: unclear tests for new drug. The New York Times April 19, 2010.

[12] Ioannidis JP. Why most published research findings are false. PLoS Med 2:e124, 2005.

[13] Baker M. Reproducibility crisis: Blame it on the antibodies. Nature 521:274-276, 2015.

[14] Naik G. Scientists' Elusive Goal: Reproducing Study Results. Wall Street Journal December 2, 2011.

[15] Innovation or Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products. U.S. Department of Health and Human Services, Food and Drug Administration, 2004.

[16] Hurley D. Why Are So Few Blockbuster Drugs Invented Today? The New York Times November 13, 2014.

[17] Ioannidis JP. Microarrays and molecular research: noise discovery? The Lancet 365:454-455, 2005.

[18] Vlasic B. Toyota's slow awakening to a deadly problem. The New York Times, February 1, 2010.

[19] Lanier J. The complexity ceiling. In: Brockman J, ed. The next fifty years: science in the first half of the twenty-first century. Vintage, New York, pp 216-229, 2002.

[20] Labos C. It Ain't Necessarily So: Why Much of the Medical Literature Is Wrong. Medscape News and Perspectives. September 09, 2014

[21] Gilbert E, Strohminger N. We found only one-third of published psychology research is reliable - now what? The Conversation. August 27, 2015. Available at: http://theconversation.com/we-found-only-one-third-of-published-psychology-research-is-reliable-now-what-46596, viewed on August 27,2015.

Saturday, March 19, 2016


This is the last of my blogs related to topics selected from Data Simplification: Taming Information With Open Source Tools (released March, 2016). I hope that as you page back through my posts on Data Simplification topics, appearing throughout this month's blog, you'll find that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

A file that big?
It might be very useful.
But now it is gone.

-Haiku by David J. Liszewski

Your scripts create data objects, and the data objects hold data. Sometimes, these data objects are transient, existing only during a block or subroutine. At other times, the data objects produced by scripts represent prodigious amounts of data, resulting from complex and time-consuming calculations. What happens to these data structures when the script finishes executing? Ordinarily, when a script stops, all the data produced by the script simply vanishes.

Persistence is the ability of data to outlive the program that produced it. The methods by which we create persistent data are sometimes referred to as marshalling or serializing. Some of the language specific methods are called by such colorful names as data dumping, pickling, freezing/thawing, and storable/retrieve.

Data persistence can be ranked by level of sophistication. At the bottom is the exportation of data to a simple flat-file, wherein records are each one line in length, and each line of the record consists of a record key, followed by a list of record attributes. The simple spreadsheet stores data as tab delimited or comma separated line records. Flat-files can contain a limitless number of line records, but spreadsheets are limited by the number of records they can import and manage. Scripts can be written that parse through flat-files line by line (i.e., record by record), selecting data as they go. Software programs that write data to flat-files achieve a crude but serviceable type of data persistence.

A middle-level technique for creating persistent data is the venerable database. If nothing else, databases are made to create, store, and retrieve data records. Scripts that have access to a database can achieve persistence by creating database records that accommodate data objects. When the script ends, the database persists, and the data objects can be fetched and reconstructed for use in future scripts.

Perhaps the highest level of data persistence is achieved when complex data objects are saved in toto. Flat-files and databases may not be suited to storing complex data objects, holding encapsulated data values. Most languages provide built-in methods for storing complex objects, and a number of languages designed to describe complex forms of data have been developed. Data description languages, such as YAML (Yet Another Markup Language) and JSON (JavaScript Object Notation) can be adopted by any programming language.

Data persistence is essential to data simplification. Without data persistence, all data created by scripts is volatile, obliging data scientists to waste time recreating data that has ceased to exist. Essential tasks such as script debugging and data verification become impossible. It is worthwhile reviewing some of the techniques for data persistence that are readily accessible to Perl, Python and Ruby programmers.

Perl will dump any data structure into a persistent, external file, for later use. Here, the Perl script, data_dump.pl, creates a complex associative array, "%hash", which nests within itself a string, an integer, an array, and another associative array. This complex data structure is dumped into a persistent structure (i.e., an external file named dump_struct).
use Data::Dump qw(dump);
%hash = (
    number => 42,
    string => 'This is a string',
    array  => [ 1 .. 10 ],
    hash   => { apple => 'red', banana => 'yellow'},);
open(OUT, ">dump_struct");
print OUT dump \%hash;
The Perl script, data_slurp.pl picks up the external file, "dump_struct", created by the data_dump.pl script, and loads it into a variable.
use Data::Dump qw(dump);
open(IN, "dump_struct");
$data = eval ;
close $in;
dump $data;
Here is the output of the data_slurp.pl script, in which the contents in the variable "$data" are dumped onto the output screen:
  array  => [1 .. 10],
  hash   => { apple => "red", banana => "yellow" },
  number => 42,
  string => "This is a string",
Python pickles its data. Here, the Python script, pickle_up.py, pickles a string variable
import pickle
pumpkin_color = "orange"
pickle.dump( pumpkin_color, open( "save.p", "wb" ) )
The Python script, pickle_down.py, loads the pickle file, "save.p" and prints it to the screen.
import pickle
pumpkin_color = pickle.load( open( "save.p", "rb" ) )
The output of the pickle_down.py script is shown here:
Where Python pickles, Ruby marshalls. In Ruby, whole objects, with their encapsulated data, are marshalled into an external file and demarshalled at will. Here is a short Ruby script, object_marshal.rb, that creates a new class, "Shoestring", a new class object, "loafer", and marshalls the new object into a persistent file, "output_file.per".

class Shoestring < String   
  def initialize 
    @object_uuid = (`c\:\\cygwin64\\bin\\uuidgen.exe`).chomp
  def object_uuid
    print @object_uuid

loafer = Shoestring.new
output = File.open("output_file.per", "wb")
The script produces no output other than the binary file, "output_file.per". Notice that when we created the object, loafer, we included a method that encapsulates within the object a full uuid identifier, courtesy of cygwin's bundled utility, "uuidgen.exe".

We can demarshal the persistent "output_file.per" file, using the ruby script, object_demarshal.rb:

class Shoestring < String   
  def initialize 
    @object_uuid = `c\:\\cygwin64\\bin\\uuidgen.exe`.chomp
  def object_uuid
    print @object_uuid

array = []
out = File.open("output_file.per", "rb").each do 
  array << Marshal::load(object)
  array.each do
    puts object.object_uuid
    puts object.class
    puts object.class.superclass
The Ruby script, object_demarshal.rb, pulls the data object from the persistent file, "output_file.per" and directs Ruby to list the uuid for the object, the class of the object, and the superclass of the object.
Perl, Python and Ruby all have access to external database modules that can build database objects that exist as external files that persist after the script has executed. These database objects can be called from any script, with the contained data accessed quickly, with a simple command syntax (1).

Here is a Perl script, lucy.pl, that creates an associative array and ties it to a external database file, using the SDBM_file (Simple Database Management File) module.
use Fcntl;
use SDBM_File;
tie %lucy_hash, "SDBM_File", 'lucy', O_RDWR|O_CREAT|O_EXCL, 0644;
$lucy_hash{"Fred Mertz"} = "Neighbor";
$lucy_hash{"Ethel Mertz"} = "Neighbor";
$lucy_hash{"Lucy Ricardo"} = "Star";
$lucy_hash{"Ricky Ricardo"} = "Band leader";
untie %lucy_hash;
The lucy.pl script produces a persistent, external file, from which any Perl script can access the associative array created in the prior script. If we look in the directory from which the lucy.pl script was launched, we will find two new SDBM (Simple DataBase Manager) files, lucy.dir and lucy.pag. These are the persistent files that will substitute for the %lucy_hash associative array when invoked within other Perl scripts.

Here is a short Perl script, lucy_untie.pl, that extracts the persistent %lucy_hash associative array from the SDBM file in which it is stored:
use Fcntl;
use SDBM_File;
tie %lucy_hash, "SDBM_File", 'lucy', O_RDWR, 0644;
while(($key, $value) = each (%lucy_hash))
  print "$key => $value\n";
untie %mesh_hash;
Here is the output of the lucy_untie.pl script:
Fred Mertz => Neighbor
Ethel Mertz => Neighbor
Lucy Ricardo => Star
Ricky Ricardo => Band leader
Here is the Python script, lucy.py, that creates a tiny external database. [jb meta.txt]
import dumbdbm
lucy_hash = dumbdbm.open('lucy', 'c')
lucy_hash["Fred Mertz"] = "Neighbor"
lucy_hash["Ethel Mertz"] = "Neighbor"
lucy_hash["Lucy Ricardo"] = "Star"
lucy_hash["Ricky Ricardo"] = "Band leader"
Here is the Python script, lucy_untie.py, that reads all of the key,value pairs held in the persistent database created for the lucy_hash dictionary object.
import dumbdbm
lucy_hash = dumbdbm.open('lucy')
for character in lucy_hash.keys():
  print character, lucy_hash[character]
Here is the output produced by the Python script, lucy_untie.py script.
Fred Mertz Neighbor
Ethel Mertz Neighbor
Lucy Ricardo Star
Ricky Ricardo Band leader
Ruby can also hold data in a persistent database, using the gdbm module. If you do not have the gdbm (GNU database manager) module installed in your Ruby distribution, you can install it as a Ruby GEM, using the following command line, from the system prompt:
c:\>gem install gdbm
The Ruby script, lucy.rb, creates an external database file, lucy.db:
require 'gdbm'
lucy_hash = GDBM.new("lucy.db")
lucy_hash["Fred Mertz"] = "Neighbor"
lucy_hash["Ethel Mertz"] = "Neighbor"
lucy_hash["Lucy Ricardo"] = "Star"
lucy_hash["Ricky Ricardo"] = "Band leader"
The Ruby script, ruby_untie.db, reads the associate array stored as the persistent database, lucy.db:
require 'gdbm'
gdbm = GDBM.new("lucy.db")
gdbm.each_pair do |name, role|
  print "#{name}: #{role}\n"
The output from the lucy_untie.rb script is:
Ethel Mertz: Neighbor
Lucy Ricardo: Star
Ricky Ricardo: Band leader
Fred Mertz: Neighbor
Persistence is a simple and fundamental process ensuring that data created in your scripts can be recalled by yourself or by others who need to verify your results. Regardless of the programming language you use, or the data structures you prefer, you will need to familiarize with at least one data persistence technique.

- Jules Berman (copyrighted material)

key words: computer science, data science, data analysis, data simplification, simplifying data, persistence, databases, jules j berman


[1] Berman JJ. Methods in Medical Informatics: Fundamentals of Healthcare Programming in Perl, Python, and Ruby. Chapman and Hall, Boca Raton 2010.

Tuesday, March 15, 2016

DATA SIMPLIFICATION: The Many Uses of Random Number Generators

Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 17, 2016). I hope I can convince you that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

If you are among the many students and professionals who are intimidated by statistics, then fear no more! With a little imagination, random number generators (to be accurate, pseudorandom number generators) can substitute for a wide range of statistical methods.

As it happens, modern computers can perform two simple processes, easily and very quickly. These two processes are: 1) generating random numbers, and 2) repeating sets of instructions thousands or millions of times. Using these two computational steps, we can accurately predict outcomes that would be intractable to any direct mathematical analysis. You are about to be rewarded with simple methods whereby every statistical test can be replicated and every probabilistic dilemma can be resolved; usually with a few lines of code (1-5).

To begin, let's perform a few very simple simulations that confirm what we already know, intuitively. Imagine that you have a pair of dice, and you would like to know how often you might expect each of the numbers (from one to six) to appear after you've thrown one die (5).

Let's simulate 600,000 throws of a die, using the Perl script, randtest.pl:
     $count = 0;
     while ($count < 600000)
        $one_of_six = (int(rand(6))+1);
      while(($key, $value) = each (%hash))
      print "$key => $value\n";
The script, randtest.pl, begins by setting a loop that repeats 600,000 times, each repeat simulating the cast of a die. With each cast of the die, Perl generates a random integer, 1 through 6, simulating the outcome of a throw. The most important line of code is:
$one_of_six = (int(rand(6))+1);
The rand(6) command yields a pseudorandom number of value less than 6. We integerize the result using Perl's int() function, which truncates anything past the decimal point. This produces integer values of 0,1,2,3,4, or 5. We increment each value to produce 1,2,3,4,5 or 6. The script yields the total number die casts that would be expected for each of the possible outcomes.

Here is the output of randtest.pl.
C:\ftp>perl randtest.pl
1 => 100002
2 => 99902
3 => 99997
4 => 100103
5 => 99926
6 => 100070
As one might expect, each of the six equally likely outcomes of a thrown die occurred about 100,000 times, in our simulation.

Repeating the randtest.pl script produces a different set of outcome numbers, but the general result is the same. Each die outcome had about the same number of occurrences.
C:\ftp>perl randtest.pl
1 => 100766
2 => 99515
3 => 100157
4 => 99570
5 => 100092
6 => 99900
Let's get a little more practice with random number generators, before moving onto more challenging simulations. Occasionally in scripts, we need to create a new file, automatically, during the script's run time, and we want to be fairly certain that the file we create will not have the same filename as an existing file. An easy way of choosing a filename is to grab, at random, printable characters, concatenating them into an 11 character string suitable as a filename. The chance that you'll encounter two files with the same randomly chosen filename is very remote. In fact, the likelihood that any two selected filenames are identical exceeds to 2 to the 44th power.

Here is a Perl script, random_filenames.pl, that assigns a sequence of 11 randomly chosen uppercase alphabetic characters to a file name:
for ($count = 1; $count <= 12; $count++)
  push(@listchar, chr(int(rand(26))+65));
$listchar[8]= ".";
$randomfilename = join("",@listchar);
print "Your filename is $randomfilename\n";
Here is the output of the ranfile.pl script:
Your filename is OAOKSXAH.SIT
The key line of code in random_filenames.pl is:
push(@listchar, chr(int(rand(26))+65));
The rand(26) command yields a random value less than 26. The int() command converts the value to an integer. The number 65 is added to the value to produce a value ranging from 65 to 90, and the chr() command converts the numbers 65 through 90 to their ASCII equivalent; which just happen to be the uppercase alphabet from A to Z. The randomly chosen letter is pushed onto an array, and the process is repeated until a 12 character filename is generated.

Here is the equivalent Python script, random_filenames.py, that produces one random filename
import random
filename = [0]*12
filename = map(lambda x: x is "" or chr(int(random.uniform(0,25) + 65)), filenam
print ''.join(filename[0:8]) + "." + ''.join(filename[9:12])
Here is the outcome of the random_filenames.py script:
In both these scripts, as in all of the scripts in this section, many outcomes may result from a small set of initial conditions. It's much easier to write these programs and observe their outcomes than to directly calculate all the possible outcomes of a set of governing equations.[jb outline.txt]

Let's use a random number generator to calculate the value of pi, without measuring anything, and without resorting to summing an infinite series of numbers. Here is a simple python script, pi.py, that does the job.
import random
from math import sqrt
totr = 0
totsq = 0
for iterations in range(10000000):
  x= random.uniform(0,1)
  y= random.uniform(0,1)
  r= sqrt((x*x) + (y*y))
  if r < 1:
    totr = totr + 1
  totsq = totsq + 1
print float(totr)*4.0/float(totsq)
The outcome of the pi.py script is:
Here is an equivalent Ruby script. [jb]
x = y = totr = totsq = 0.0
(1..100000).each do
  x = rand()
  y = rand()
  r = Math.sqrt((x*x) + (y*y))
  totr = totr + 1 if r < 1
  totsq = totsq + 1
puts (totr *4.0 / totsq)
Here is an equivalent Perl script. [jb]
for (1..10000)
  $x = rand();
  $y = rand();
  $r = sqrt(($x*$x) + ($y*$y));
  if ($r < 1)
    $totr = $totr + 1;
    print DATA "$x\ $y\n";
  $totsq = $totsq + 1;
print eval(4 * ($totr / $totsq)); 
As one would hope, all three scripts produce approximately the same value of pi. The Perl script contains a few extra lines that produces an output file, named pi.dat, that will helps us visualize how these scripts work. The pi.dat script contains the x,y data points, generated by the random number generator, meeting the "if" statement's condition that the hypotenuse of the x,y coordinates must be less than one (i.e., less than a circle of radius 1). [jb]

We can plot the output of the script with a few lines of Gnuplot code:
gnuplot> set size square
gnuplot> unset key
gnuplot> plot 'c:\ftp\pi.dat'
The resulting graph is a quarter-circle within a square.

The data points produced by 10,000 random assignments
of x and y coordinates in a range of 0 to 1.
Randomly assigned data points whose hypotenuse
exceeds "1" are excluded.

The graph shows us, at a glance, how the ratio of the number of points in the quarter-circle, as a fraction of the total number of simulations, is related to the value of pi.

- Jules Berman (copyrighted material)

key words: computer science, data analysis, data repurposing, data simplification, simplifying data, random, pseudorandom, resampling, probability, simulations, Monte Carlo jules j berman


[1] Simon JL. Resampling: The New Statistics. Second Edition, 1997. Available online at: http://www.resample.com/intro-text-online/, viewed on September 21, 2015.

[2] Efron B, Tibshirani RJ. An Introduction to the Bootstrap. CRC Press, Boca Raton, 1998.

[3] Diaconis P, Efron B. Computer-intensive methods in statistics. Scientific American, May, 116-130, 1983. Comment. Oft-cited explanation of resampling statistics, a field largely credited to Bradley Efron. The articles contains examples in Basic, Pascal, and Fortran source code.

[4] Anderson HL. Metropolis, Monte Carlo and the MANIAC. Los Alamos Science 14:96-108, 1986. Available at: http://library.lanl.gov/cgi-bin/getfile?00326886.pdf, viewed September 21, 2015.

[5] Berman JJ. Biomedical Informatics. Jones and Bartlett, Sudbury, MA, 2007.

Monday, March 14, 2016

DATA SIMPLIFICATION: Abbreviations and Acronyms

Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 17, 2016). I hope I can convince you that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

"A synonym is a word you use when you can't spell the other one." -Baltasar Gracian

People confuse shortening with simplifying; a terrible mistake. In point of fact, next to reifying pronouns, abbreviations are the most vexing cause of complex and meaningless language. Before we tackle the complexities of abbreviations, let's define our terms. An abbreviation is a shortened form of a word or term. An acronym is a an abbreviation composed of letters extracted from the words composing a multi-word term. There are two major types of abbreviations: universal/permanent and local/ephemeral. The universal/permanent abbreviations are recognized everywhere and have been used for decades (e.g., USA, DNA, UK). Some of the universal/permanent abbreviations, ascend to the status of words whose long-forms have been abandoned. For example, we use laser as a word. Few who use the term know that "laser" is an acronym for "light amplification by stimulated emission of radiation". Local/ephemeral abbreviations are created for terms that are repeated within a particular document or a particular class of documents. Synonyms and plesionyms (i.e., near-synonyms) allow authors to represent a single concept using alternate terms (1).

Abbreviations make textual data complex, for three three principle reasons:

1. No rules exist with which abbreviations can be logically expanded to their full-length form.

2. A single abbreviation may mean different things to different individuals, or to the same individual at different times.

3. A single term may have multiple different abbreviations. (In medicine, Angioimmunoblastic lymphadenopathy can be abbreviated as ABL, AIL, or AIML.) These are the so-called polysemous abbreviations (See Glossary item, Polysemy). In the medical literature, a single abbreviations may have dozens of different expansions (1).

Some of the worst abbreviations fall into one of the following catagories:

Abbreviations that are neither acronyms nor shortened forms of expansions. For example, the short form of "diagnosis" is "dx", although no "x" is contained therein. The same applies to the "x" in "tx", the abbreviation for "therapy", but not the "X" in "TX" that stands for Texas. For that matter, the short form of "times" is an "x", relating to the notation for the multiplication operator. Roman numerals I, V, X, L and M are abbreviations for words assigned to numbers, but they are not characters included in the expanded words (e.g., there is no "I" in "one"). EKG is the abbreviation for electrocardiogram, a word totally bereft of any "K". The "K" comes from the German orthography. There is no letter "q" in subcutaneous, but the abbreviation for the word is sometimes "subq"; never "subc". What form of alchemy converts ethanol to its common abbreviation, "EtOH"?

Mixed-form abbreviations. In medical lingo "DSV" represents the Dermatome of the fifth (V) Sacral nerve. Here a preposition, an article, and a noun (of, the, nerve) have all been unceremoniously excluded from the abbreviation; the order or the acronym components have been transposed (dermatome sacral fifth); an ordinal has been changed to a cardinal (fifth changed to five), and the cardinal has been shortened to its roman numeral equivalent (V).

Prepositions and articles arbitrarily retained in an acronym. When creating an abbreviation, should we retain or abandon prepositions? Many acronyms exclude prepositions and articles. USA is the acronym for United States of America; the "of" is ignored. DOB (Date Of Birth) remembers the "of".

Single expansions with multiple abbreviations. Just as abbreviations can map to many different expansions, the reverse can occur. For instance, high-grade squamous intraepithelial lesion can be abbreviated as HGSIL or HSIL. Xanthogranulomatous pyelonephritis can be abbreviated as xgp or xgpn.

Recursive abbreviations. The following example exemplifies the horror of recursive abbreviations. The term SMETE is the abbreviation for the phrase "science, math, engineering, and technology education". NSDL is a real-life abbreviation, for "National SMETE digital Library community". To fully expand the term (i.e., to provide meaning to the abbreviation), you must recursively expand the embedded abbreviation, to produce "National science, math, engineering, and technology education digital Library community."

Stupid or purposefully unhelpful abbreviations. The term GNU (Gnu is not UNIX) is a recursive acronym. Fully expanded, this acronym is of infinite length. Although the N and the U expand to words ("Not Unix"), the letter G is simply inscrutable. Another example of an inexplicable abbreviation is PT-LPD (post-transplantation lymphoproliferative disorders). The only logical location for a hyphen would be smack between the letters p and t. Is the hyphen situated between the T and the L for the sole purpose of irritating us?

Abbreviations that change from place to place. Americans sometimes forget that most English-speaking countries use British English. For example an esophagus in New York is an oesophagus in London. Hence TOF makes no sense as an abbreviation of tracheo-esophageal fistula here in the U.S. but this abbreviation makes perfect sense to physicians in England, where a patients may have a Trancheo-Oesophageal Fistula. The term GERD (representing the phrase gastroesophageal reflux disease) makes perfect sense to Americans, but it must be confusing in Britain, where the esophagus is not an organ.

Abbreviations masquerading as words. Our greatest vitriol is reserved for abbreviations that look just like common words. Some of the worst offenders come from the medical lexicon: axillary node dissection (AND), acute lymphocytic leukemia (ALL), Bornholm Eye Disease (BED), and Expired Air Resuscitation (EAR). Such acronyms aggravate the computational task confidently translating common words. Acronyms commonly appear as uppercase strings, but a review of a text corpus of medical notes has shown that words could not be consistently distinguished from homonymous word-acronyms (2).

Fatal abbreviations. Fatal abbreviations are those which can kill individuals if they are interpreted incorrectly. They all seem to originate in the world of medicine:

MVR, which can be expanded to any of: mitral vale regurgitation, mitral valve repair, or mitral valve replacement;

LLL, which can be expanded to any of: left lower lid, left lower lip, or left lower lung;

DOA, dead on arrival, date of arrival, date of admission, drug of abuse.

Is a fear of abbreviations rational, or does this fear emanate from an overactive imagination? In 2004, the Joint Commission on Accreditation of Healthcare Organizations, a stalwart institution not known to be squeamish, issued an announced that, henceforth, a list of specified abbreviations should be excluded from medical records Rboodr.

Examples of Forbidden abbreviations are:

IU (International Unit), mistaken as IV (intravenous) or 10 (ten).

Q.D., Q.O.D. (Latin abbreviation for once daily and every other day), mistaken for each other.

Trailing zero (X.0 mg) or a lack of a leading zero (.X mg), in which cases the decimal point may be missed. Never write a zero by itself after a decimal point (X mg), and always use a zero before a decimal point (0.X mg).

MS, MSO4, MgSO4 all of which can be confused with one another and with morphine sulfate or magnesium sulfate. Write "morphine sulfate" or "magnesium sulfate."

Abbreviations on the hospital watch list were:

mg (for microgram), mistaken fir mg (milligrams), resulting in a 1000-fold dosing overdose.

h.s., which can mean either half-strength or the Latin abbreviation for bedtime or may be mistaken for q.h.s., taken every hour. All can result in a dosing error.

T.I.W. (for three times a week), mistaken for three times a day or twice weekly, resulting in an overdose.

The list of abbreviations that can kill, in the medical setting, is quite lengthy. Fatal abbreviations probably devolved through imprecise, inconsistent, or idiosyncratic uses of an abbreviation, by the busy hospital staff who enter notes and orders into patient charts. For any knowledge domain, the potentially fatal abbreviations is the most important to catch.

Nobody has ever found an accurate way of disambiguating and translating abbreviations (1). There are, however a few simple suggestions, based on years of exasperating experience, that might save you time and energy.

1. Disallow the use of abbreviations, whenever possible. Abbreviations never enhance the value of information. The time saved by using an abbreviation is far exceeded by the time spent attempting to deduce its correct meaning.

2. When writing software applications that find and expand abbreviations, the output should list every known expansion of the abbreviation. For example, the abbreviation, "ppp" appearing in a medical report, should have all these expansions inserted into the text, as annotations: pancreatic polypeptide, palatopharyngoplasty, palmoplantar pustulosis, pancreatic polypeptide, pentose phosphate pathway, platelet poor plasma, primary proliferative polycythaemia, primary proliferative polycythemia. Leave it up to the knowledge domain experts to disambiguate the results.

- Jules Berman (copyrighted material)

key words: computer science, data analysis, data repurposing, data simplification, simplifying data, abbreviations, acronyms, complexity jules j berman


[1] Berman JJ. Pathology abbreviated: a long review of short terms. Arch Pathol Lab Med 128:347-352, 2004.

[2] Nadkarni P, Chen R, Brandt C. UMLS concept indexing for production databases. JAMIA 8:80-91, 2001.

Sunday, March 13, 2016


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 17, 2016). I hope I can convince you that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

Yesterday's blog covered lists of single words. Today we'll do doublets.

Doublet lists (lists of two-word terms that occur in common usage or in a body of text) are a highly underutilized resource. The special value of doublets is that single word terms tend to have multiple meanings, while doublets tend to have specific meaning.

Here are a few examples:

The word "rose" can mean the past tense of rise, or the flower. The doublet "rose garden" refers specifically to a place where the rose flower grows.

The word "lead" can mean a verb form of the infinitive, "to lead", or it can refer to the metal. The term "lead paint" has a different meaning than "lead violinist". Furthermore, every multiword term of length greater than two can be constructed with overlapping doublets, with each doublet having a specific meaning.

For example, "Lincoln Continental convertible" = "Lincoln Continental" + "Continental convertible". The three words, "Lincoln", "Continental", and "convertible" all have different meanings, under different circumstances. But the two doublets, "Lincoln Continental" and "Continental Convertible" would be unusual to encounter on their own, and produce a unique meaning, when combined.

Perusal of any nomenclature will reveal that most of the terms included in nomenclatures consist of two or more words. This is because single word terms often lack specificity. For example, in a nomenclature of recipes, you might expect to find, "Eggplant Parmesan" but you may be disappointed if you look for "Eggplant" or "Parmesan". In a taxonomy of neoplasms, available at: http://www.julesberman.info/figs/neocl_f.htm, containing over 120,000 terms, only a few hundred of those terms are single word terms (1).

Lists of doublets, collected from a corpus of text, or from a nomenclature, have a variety of uses in data simplification projects (1-3). We will show examples in Section 5.4, and in "On-the-fly indexing scripts" later in this chapter.

For now, you should know that compiling doublet lists, from any corpus of text, is extremely easy.

Here is a perl script, doublet_maker.pl, that creates a list of alphabetized doublets occurring in any text file of your choice (filename.txt in this example):
$var = ;
$var =~ s/\n/ /g;
$var =~ s/\'s//g;
$var =~ tr/a-zA-Z\'\- //cd;
@words = split(/ +/, $var);
foreach $thing (@words)
  $doublet = "$oldthing $thing";
  if ($doublet =~ /^[a-z]+ [a-z]+$/)
  $oldthing = $thing;
close TEXT;
@wordarray = sort(keys(%doublethash));
print OUT join("\n",@wordarray);
close OUT;
Here is an equivalent Python script, doublet_maker.py:
import anydbm, string, re
in_file = open('filename.txt', "r")
out_file = open('doubs.txt',"w")
doubhash = {}
for line in in_file:
  line = line.lower()
  line = re.sub('[.,<>?/;:"[]\{}|=+-_ ()*&^%$#@!`~1234567890]', ' ', line)
  hoparray = line.split()
  hoparray.append(" ")
  for i in range(len(hoparray)-1):
     doublet = hoparray[i] + " " + hoparray[i + 1]
     if doubhash.has_key(doublet):
     doubhash_match = re.search(r'[a-z]+ [a-z]+',  doublet)
     if doubhash_match:
         doubhash[doublet] = ""
for keys,values in sorted(doubhash.items()):
    out_file.write(keys + '\n')
Here is an equivalent Ruby script, doublet_maker.rb that creates a doublet list from file filename.txt:
intext = File.open("filename.txt", "r")
outtext = File.open("doubs.txt", "w")
doubhash = Hash.new(0)
line_array = Array.new(0)
while record = intext.gets
  oldword = ""
  line_array = record.chomp.strip.split(/\s+/)
  line_array.each do
    doublet = [oldword, word].join(" ")
    oldword = word
    next unless (doublet =~ /^[a-z]+\s[a-z]+$/)
    doubhash[doublet] = ""
doubhash.each {|k,v| outtext.puts k }
I have deposited a public domain doublet list, available for download at:


The first few lines of the list are shown:
a bachelor
a background
a bacteremia
a bacteria
a bacterial
a bacterium
a bad
a balance
a balanced
a banana

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, word lists, doublet lists, n-grams, complexity, open source tools, jules j berman


[1] Berman JJ. Automatic extraction of candidate nomenclature terms using the doublet method. BMC Medical Informatics and Decision Making 5:35, 2005.

[2] Berman JJ. Doublet method for very fast autocoding. BMC Med Inform Decis Mak, 4:16, 2004.

[3] Berman JJ. Nomenclature-based data retrieval without prior annotation: facilitating biomedical data integration with fast doublet matching. In Silico Biol, 5:0029, 2005. Available at: http://www.bioinfo.de/isb/2005/05/0029/, viewed on September 6, 2015.

Saturday, March 12, 2016


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

Word lists, for just about any written language for which there is an electronic literature, are easy to create. Here is a short Python script, words.py, that prompts the user to enter a line of text. The script drops the line to lowercase, removes the carriage return at the end of the line, parses the result into an alphabetized list, removes duplicate terms from the list, and prints out the list, with one term assigned to each line of output. This words.py script can be easily modified to create word lists from plain-text files (See Glossary item, Metasyntactic variable).
import sys, re, string
print "Enter a line of text to be parsed into a word list"
line = sys.stdin.readline()
line = string.lower(line)
line = string.rstrip(line)
linearray = sorted(set(re.split(r' +', line)))
for i in range(0, len(linearray)):
Here is some a sample of output, when the input is the first line of Joyce's Finegans Wake:
Enter a line of text to be parsed into a word list

a way a lone a last a loved a long the riverrun, past Eve and Adam's, from 
swerve of shore to bend of bay, brings us by a commodius vicus

Here is a nearly equivalent Perl script, words.pl, that creates a wordlist from a file. In this case, the chosen file happens to be "gettbysu.txt", containing the full-text of the Gettysburg address. We could have included the name of any plain-text file.
open(TEXT, "gettysbu.txt");
$var = lc();
$var =~ s/\n/ /g;
$var =~ s/\'s//g;
$var =~ tr/a-zA-Z\'\- //cd;
@words = sort(split(/ +/, $var));
@words = grep($_ ne $prev && (($prev) = $_), @words);
print (join("\n",@words));
The words.pl script was designed for speed. You'll notice that it slurps the entire contents of a file into a string variable. If we were dealing with a very large file, that exceeded the functional RAM memory limits of our computer, we would need to modify the script to parse through the file line-by-line.

Aside from word lists you create for yourself, there are a wide variety of specialized knowledge domain nomenclatures that are available to the public (1), (2), (3), (4), (5), (6). Linux distributions often bundle a wordlist, under filename "words", that is useful for parsing and natural language processing applications. A copy of the linux wordlist is available at:


Curated lists of terms, either generalized, or restricted to a specific knowledge domain, are indispensable for a variety of applications (e.g., spell-checkers, natural language processors, machine translation, coding by term, indexing. Personally, I have spent an inexcusable amount of time creating my own lists, when no equivalent public domain resource was available.

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, system calls, Perl, Python, open source tools, utility, word lists, jules j berman


[1] Medical Subject Headings. U.S. National Library of Medicine. Available at: https://www.nlm.nih.gov/mesh/filelist.html, viewed on July 29, 2015.

[2] Berman JJ. A Tool for Sharing Annotated Research Data: the "Category 0" UMLS (Unified Medical Language System) Vocabularies. BMC Med Inform Decis Mak, 3:6, 2003.

[3] Berman JJ Tumor taxonomy for the developmental lineage classification of neoplasms. BMC Cancer 4:88, 2004. http://www.biomedcentral.com/1471-2407/4/88, viewed Jan. 1, 2015.

[4] Hayes CF, O'Connor JC. English-Esperanto Dictionary. Review of Reviews Office, London, 1906. Availalable at: http://www.gutenberg.org/ebooks/16967 viewed on July 29, 2105.

[5] Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform 40:30-43, 2007.

[6] NCI Thesaurus. National Cancer Institute, U.S. National Institutes of Health, Bethesda, MD. Available at: ftp://ftp1.nci.nih.gov/pub/cacore/EVS/NCI_Thesaurus/ viewed on July 29, 2015.

Friday, March 11, 2016


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

In yesterday's blog, I discussed using system calls within your scripts. One of my examples called an ImageMagick. Today, I thought I'd describe ImageMagick, and some of its benefits.

ImageMagick is an open source utility that supports a huge selection of robust and sophisticated image editing methods. Its source code download site is:


Users may find it convenient to download the executable binaries, for their specific operating system, from:


Hundreds of ImageMagick methods are described in detail, and useful examples are provided, at:


There are several things you should know about ImageMagick:

1. Unlike the commercial image processing applications, ImageMagick has no graphic user interface. ImageMagick is intended to serve as a command line utility.

2. ImageMagick is powerful. There are hundreds of available methods for creating and modifying images.

3. ImageMagick can be called from Python, Perl, Ruby, and other scripting languages, via system calls or via language-specific ImageMagick interface modules (i.e., PerlMagick, PythonMagick, or RMagick)

Here are a few examples of ImageMagick command lines that can be launched from the system prompt (which happens to be sitting at the c:\ftp subdirectory on my home computer):

Converts an image in SVG format to JPEG format.
c:\ftp>convert mtor_pathway.svg mtor_pathway.jpg
Creates a thumbnail image from image.jpg and converts it to gif format.
c:\ftp>convert -size 600x800 image.jpg -thumbnail 120x160 image_thumb.gif
Applies contrast twice to an image, and produce a new file for the results.
c:\ftp>convert original.jpg -contrast -contrast result.png
Puts the contents of file words.txt into the comment section of tar1.jpg and keeps the same filename for the output file.
c:\ftp>convert -comment @words.txt tar1.jpg tar1.jpg
Displays a verbose description of an image, including the header contents.
c:\ftp>identify -verbose tar1.jpg
The real power of ImageMagick comes when it is inserted into scripts, allowing the programmer to perform batch operations of image collections.

Here is a Python script, image_resize.py, that creates a resized copy of every jpeg file in the current directory.
import sys, os, re 
filelist = os.listdir(".")
pattern = re.compile(".jpg$")
for filename in filelist:
  if pattern.search(filename):
    out_filename = pattern.sub('_small.jpg', filename)
    cmdstring = "convert " + filename + " -resize 400x267! " + out_filename
Here is a Perl script, factth.pl, that reduces the size of every image in a subdirectory that exclusively contains image files. In this case, the all-image subdirectory is "c:\ftp\factth".
$newdir = "c\:\\ftp\\factth";
opendir (MYDIR, $newdir) || die ("Can't open directory");
chdir ($newdir);
while ($file = readdir (MYDIR))
  next if (-d $file);
  next if ($file eq "." || $file eq "..");
  system("convert $file -resize 30% $file");
closedir (MYFILE);

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, system calls, Perl, Python, open source tools, utility, Image Magick, ImageMagick, jules j berman

Thursday, March 10, 2016


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

A system call is a command line, inserted into a software program, that interrupts the script while the operating system executes the command line. Immediately afterwords, the script resumes, at the next line. Any utility that runs from the command line can be embedded in any scripting language that supports system calls, and this includes all of the languages discussed in this book.

Here are the properties of system calls that make them useful to programmers:

1. System calls can be inserted into iterative loops (e.g., while loops, for loops), so that they can be repeated any number of times, on collections of files, or data elements.

2. Variables that are generated at run-time (i.e.,during the execution of the script) can be included as arguments added to the system call.

3. The results of the system call can be returned to the script, and used as variables.

4. System calls can utilize any operating system command and any program that would normally be invoked through a command line, including external scripts written in other programming languages. Hence, a system call can initiate an external script written in an alternate programming language, composed at run-time within the original script, using variables generated in the original script, and capturing the output from the external script for use in the original script!

System calls enhance the power of any programming language by providing access to a countless number of external methods and by participating in iterated actions using variables created at run-time.

How does the system call help with the task of data simplification? Data simplification is very often focused on uniformity and reproducibility. If you have 100,000 images, data simplification might involve calling ImageMagick to resize every image to the same height and width. If you need to convert spreadsheet data to a set of triples, than you might need to provide a UUID string (see prior blog) to every triple in the database, all at once. If you are working on a Ruby project, and you need to assert one of Python's numpy methods, on every data file in a large collection of data files, then you might want to create a short Python file that you can be accessed, via a system call, from your Ruby script.

Once you have gotten the hang of including system calls in your scripts, you will probably use them in most of your your data simplification tasks. It's important to know how system calls can be used to great advantage, in Perl, Python, and Ruby. A few examples follow.

The following short Perl script makes a system call, consisting of the DOS "dir" command:
The "dir" command, launched as a system call, displays the files in the current directory. Here is the equivalent script, in Python:
import os
Notice that system calls in Python require the importation of the os (operating system) module into the script.

Here is an example of a Ruby system call, to ImageMagick's "Identify" utility [note: this only works if you have pre-installed ImageMagick]. The system call instructs the "Identify" utility to provide a verbose description of the image file3320_out.jpg, and to pipe the output into the text file, myimage.txt.
system("Identify -verbose c:/ftp/3320_out.jpg >myimage.txt")
Here is an example of a Perl system call, to ImageMagick's "convert" utility, that incorporates a Perl variable ($file, in this case) that is passed to the system call [note: this only works if you have pre-installed ImageMagick].
$file = "try2.gif";
system("convert -size 350x40 xc:lightgray -font Arial -pointsize 32 -fill black
-gravity north -annotate +0+0 \"Hello, World\" $file");
The following Python script opens the current directory and parses through every filename, looking for jpeg image files. When a jpeg file is encountered, the script makes a system call to imagemagick, instructing imagemagick's "convert" utility to copy the jpeg file to the thumb drive (designated as the f: drive), in the form of a grayscale image. If you try this script at home, be advised that it requires a mounted thumb drive, in the "f:" drive [note: this only works if you have pre-installed ImageMagick].
import os, re, string
filelist = os.listdir(".")
for file in filelist:
  if ".jpg" in file:  
    img_in = file
    img_out = "f:/" + file 
    command = "convert " + img_in + " -set colorspace Gray -separate -average " + img_out
Let's look at a Ruby script that calls a Perl script, a Python script, and another Ruby script, from within one Ruby script.

Here are the Perl, Python and Ruby scripts that will be called from within a Ruby script:
print("Hi, I'm a Python script")

print "Hi, I'm a Perl script\n";

puts "Hi, I'm a Ruby script"
Here is the Ruby script, call_everyone.rb, that calls external scripts, written in Python, Perl and Ruby:
system("python hi.py")
system("perl hi.pl")
system("ruby hi.rb")
Here is the output of the Ruby script, call_everyone.rb:
Hi, I'm a Python script
Hi, I'm a Perl script
Hi, I'm a Ruby script
If you have some facility with a variety of language-specific methods and utilities, you can deploy them all from within your favorite scripting language.

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, system calls, Perl, Python, Ruby, jules j berman