Saturday, March 5, 2016

Data Simplification: Hitting the Complexity Barrier

Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

"Nobody goes there anymore. It's too crowded." -Yogi Berra

It seems that many scientific findings, particularly those findings based on analyses of large and complex data sets, are yielding irreproducible results. We find that we can not depend on the data that we depend on. If you don't believe me, consider these shocking headlines:

1. "Unreliable research: Trouble at the lab." (1) The Economist, in 2013 ran an article examining flawed biomedical research. The magazine article referred to an NIH official who indicated that "researchers would find it hard to reproduce at least three-quarters of all published biomedical findings, the public part of the process seems to have failed." The article described a study conducted at the pharmaceutical company Amgen, wherein 53 landmark studies were repeated. The Amgen scientists were successful at reproducing the results of only 6 of the 53 studies. Another group, at Bayer HealthCare, repeated 63 studies. The Bayer group succeeded in reproducing the results of only one-fourth of the original studies.

2. "A decade of reversal: an analysis of 146 contradicted medical practices." (2) The authors reviewed 363 journal articles, reexamining established standards of medical care. Among these articles were 146 manuscripts (40.2%) claiming that an existing standard of care had no clinical value.

3."Cancer fight: unclear tests for new drug." (3). This New York Times article examined whether a common test performed on breast cancer tissue (Her2) was repeatable. It was shown that for patients who tested positive for Her2, a repeat test indicated that 20% of the original positive assays were actually negative (i.e., falsely positive on the initial test). (3).

4. "Reproducibility crisis: Blame it on the antibodies" (4). Biomarker developers are finding that they cannot rely on different batches of a reagent to react in a consistent manner, from test to test. Hence, laboratory analytic methods, developed using a controlled set of reagents, may not have any diagnostic value when applied by other laboratories, using different sets of the same analytes (4)

5. "Why most published research findings are false." (5). Modern scientists often search for small effect sizes, using a wide range of available analytic techniques, and a flexible interpretation of outcome results. The manuscript's author found that research conclusions are more likely to be false than true (5), (6).

6. "We found only one-third of published psychology research is reliable - now what?" (7). The manuscript authors suggest that the results of a first study should be considered preliminary and tentative. Conclusions have no value until they are independently validated.

Anyone who attempts to stay current in the sciences soon learns that much of the published literature is irreproducible (8); and that almost anything published today might be retracted tomorrow. This appalling truth applies to some of the most respected and trusted laboratories in the world (9), (10), (11), (12), (13), (14), (15), (16). Those of us who have been involved in assessing the rate of progress in disease research are painfully aware of the numerous reports indicating a general slowdown in medical progress (17), (18), (19), (20), (21), (22), (23), (24).

For the optimists, it is tempting to assume that the problems that we may be experiencing today are par for the course, and temporary. It is the nature of science to stall for a while and lurch forwards in sudden fits. Errors and retractions will always be with us so long as humans are involved in the scientific process.

For the pessimists, such as myself, there seems to be something going on that is really new and different; a game changer. This game changer is the "complexity barrier", a term credited to Boris Beizer, who used it to describe the impossibility of managing increasingly complex software products (25). The complexity barrier, known also as the complexity ceiling, reflects the intricacies of Big Science and applies to most of the data analysis efforts undertaken these days (26), (27).

Some of the mistakes that lead to erroneous conclusions in data-intensive research are well-known, and include the following:

1. Errors in sample selection, labeling, and measurement (28), (29), (30). For example, modern biomedical data is high-volume (e.g., gigabytes and larger), heterogeneous (i.e., derived from diverse sources), private (i.e., measured on human subjects), and multi-dimensional (e.g., containing thousands of different measurements for each data record). The complexities of handling such data correctly are daunting (31).

3. Misinterpretation of the data (32), (5), (33), (22), (34), (35), (36), (37)

4. Data hiding and data obfuscation (38), (39)

5. Unverified and unvalidated data (40), (41), (42), (43), (34), (44)

6. Outright fraud (39), (16), (45).

When errors occur in complex data analyses, they are notoriously difficult to discover (40).

Aside from human error, intrinsic properties of complex systems may thwart our best attempts at analysis. For example, when complex systems are perturbed from their normal, steady-state activities, the rules that govern the system's behavior become unpredictable (46). Much of the well-managed complexity of the world is found in machines built with precision parts having known functionality. For example, when an engineer designs a radio, she knows that she can assign names to components, and these components can be relied upon to behave in a manner that is characteristic of its type. A capacitor will behave like a capacitor, and a resistor will behave like a resistor. The engineer need not worry that the capacitor will behave like a semiconductor or an integrated circuit. The engineer knows that the function of a machine's component will never change; but the biologist operates in a world wherein components change their functions, from organism to organism, cell to cell and moment to moment. As an example, cancer researchers discovered an important protein that plays a role in the development of cancer. This protein, p53, was considered to be the primary cellular driver for human malignancy. When p53 mutated, cellular regulation was disrupted, and cells proceeded down a slippery path leading to cancer. In the past few decades, as more information was obtained, cancer researchers have learned that p53 is just one of many proteins that play some role in carcinogenesis, and that the role played by p53 changes depending on the species, tissue type, cellular microenvironment, genetic background of the cell, and many other factors. Under one set of circumstances, p53 may modify DNA repair; under another set of circumstances, p53 may cause cells to arrest the growth cycle (47), (48). It is difficult to predict the biological effect of a protein that changes its primary function based on prevailing cellular conditions.

At the heart of all data analysis is the assumption that systems have a behavior that can be described with a formula or a law, or that can lead to results that are repeatable and to conclusions that can be validated. We are now learning that our assumptions may have been wrong, and that our best efforts at data analysis may be irreproducible.

Science and society may have reached a complexity barrier beyond which nothing can be analyzed and understood with any confidence. In light of the irreproducibility of complex data analyses, it seems prudent to take the follow the following two recommendations:

1. Simplify your complex data, before you attempt analysis.

2. Assume that the first analysis of primary data is tentative and probably wrong. The most important purpose of data analysis is to lay the groundwork for data reanalysis.


[1] Unreliable research: Trouble at the lab. The Economist October 19, 2013.

[2] Prasad V, Vandross A, Toomey C, Cheung M, Rho J, Quinn S, et al. A decade of reversal: an analysis of 146 contradicted medical practices. Mayo Clin Proc 88:790-8, 2013.

[3] Kolata G. Cancer fight: unclear tests for new drug. The New York Times April 19, 2010.

[4] Baker M. Reproducibility crisis: Blame it on the antibodies. Nature 521:274-276, 2015.

[5] Ioannidis JP. Why most published research findings are false. PLoS Med 2:e124, 2005.

[6] Labos C. It Ain't Necessarily So: Why Much of the Medical Literature Is Wrong. Medscape News and Perspectives. September 09, 2014

[7] Gilbert E, Strohminger N. We found only one-third of published psychology research is reliable - now what? The Conversation. August 27, 2015. Available at:, viewed on August 27,2015.

[8] Naik G. Scientists' Elusive Goal: Reproducing Study Results. Wall Street Journal December 2, 2011.

[9] Zimmer C. A sharp rise in retractions prompts calls for reform. The New York Times April 16, 2012.

[10] Altman LK. Falsified data found in gene studies. The New York Times October 30, 1996.

[11] Weaver D, Albanese C, Costantini F, Baltimore D. Retraction: altered repertoire of endogenous immunoglobulin gene expression in transgenic mice containing a rearranged mu heavy chain gene. Cell 65:536 (inclusive), 1991.

[12] Chang K. Nobel winner in physiology retracts two papers. The New York Times September 23, 1010.

[13] Fourth paper retracted at Potti's request. The Chronicle March 3, 2011.

[14] Whoriskey P. Doubts about Johns Hopkins research have gone unanswered, scientist says. The Washington Post March 11, 2013.

[15] Lin YY, Kiihl S, Suhail Y, Liu SY, Chou YH, Kuang Z, et al. Retraction: Functional dissection of lysine deacetylases reveals that HDAC1 and p300 regulate AMPK. Nature 482:251-255, retracted November, 2013.

[16] Shafer SL. Letter: To our readers. Anesthesia and Analgesia. February 20, 2009.

[17] Innovation or Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products. U.S. Department of Health and Human Services, Food and Drug Administration, 2004.

[18] Hurley D. Why Are So Few Blockbuster Drugs Invented Today? The New York Times November 13, 2014.

[19] Angell M. The Truth About the Drug Companies. The New York Review of Books Vol 51, July 15, 2004.

[20] Crossing the Quality Chasm: A New Health System for the 21st Century. Quality of Health Care in America Committee, editors. Institute of Medicine, Washington, DC., 2001.

[21] Wurtman RJ, Bettiker RL. The slowing of treatment discovery, 1965-1995. Nat Med 2:5-6, 1996.

[22] Ioannidis JP. Microarrays and molecular research: noise discovery? The Lancet 365:454-455, 2005.

[23] Weigelt B, Reis-Filho JS. Molecular profiling currently offers no more than tumour morphology and basic immunohistochemistry. Breast Cancer Research 12:S5, 2010.

[24] Personalised medicines: hopes and realities. The Royal Society, London, 2005.Available from:, viewed Jan 1, 2015.

[25] Beizer B. Software Testing Techniques. Van Nostrand Reinhold; Hoboken, NJ 2 edition, 1990.

[26] Vlasic B. Toyota's slow awakening to a deadly problem. The New York Times, February 1, 2010.

[27] Lanier J. The complexity ceiling. In: Brockman J, ed. The next fifty years: science in the first half of the twenty-first century. Vintage, New York, pp 216-229, 2002.

[28] Bandelt H, Salas A. Contamination and sample mix-up can best explain some patterns of mtDNA instabilities in buccal cells and oral squamous cell carcinoma. BMC Cancer 9:113, 2009.

[29] Knight, J. Agony for researchers as mix-up forces retraction of ecstasy study. Nature 425:109, September 11, 2003.

[30] Gerlinger M, Rowan AJ, Horswell S, Larkin J, Endesfelder D, Gronroos E, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med 366:883-892, 2012.

[31] Berman JJ. Biomedical Informatics. Jones and Bartlett, Sudbury, MA, 2007.

[32] Ioannidis JP. Is molecular profiling ready for use in clinical decision making? The Oncologist 12:301-311, 2007.

[33] Ioannidis JP. Some main problems eroding the credibility and relevance of randomized trials. Bull NYU Hosp Jt Dis 66:135-139, 2008.

[34] Ioannidis JP, Panagiotou OA. Comparison of effect sizes associated with biomarkers reported in highly cited individual articles and in subsequent meta-analyses. JAMA 305:2200-2210, 2011.

[35] Ioannidis JPA, Panagiotou OA. "Comparison of effect sizes associated with biomarkers reported in highly cited individual articles and in subsequent meta-analyses. JAMA 305:2200-2210, 2011.

[36] Ioannidis JP: Excess significance bias in the literature on brain volume abnormalities. Arch Gen Psychiatry 68:773-780, 2011.

[37] Pocock SJ, Collier TJ, Dandreo KJ, deStavola BL, Goldman MB, Kalish LA, et al. Issues in the reporting of epidemiological studies: a survey of recent practice. BMJ 329:883, 2004.

[38] Harris G. Diabetes drug maker hid test data, files indicate. The New York Times July 12, 2010.

[39] Berman JJ. Machiavelli's Laboratory. Amazon Digital Services, Inc., 2010.

[40] Misconduct in science: an array of errors. The Economist. September 10, 2011.

[41] Begley S. In cancer science, many 'discoveries' don't hold up. Reuters Mar 28, 2012,

[42] Abu-Asab MS, Chaouchi M, Alesci S, Galli S, Laassri M, Cheema AK, et al. Biomarkers in the age of omics: time for a systems biology approach. OMICS 15:105-112, 2011.

[43] Moyer VA; on behalf of the U.S. Preventive Services Task Force. Screening for prostate cancer: U.S. Preventive Services Task Force recommendation statement. Ann Intern Med May 21, 2011

[44] How science goes wrong. The Economist Oct 19, 2013.

[45] Martin B. Scientific fraud and the power structure of science. Prometheus 10:83-98, 1992.

[46] Rosen JM, Jordan CT. The increasing complexity of the cancer stem cell paradigm. Science 324:1670-1673, 2009.

[47] Madar S, Goldstein I, Rotter V. Did experimental biology die? Lessons from 30 years of p53 research. Cancer Res 2009;69:6378-6380, 2009.

[48] Zilfou JT, Lowe SW. Tumor Suppressive Functions of p53. Cold Spring Harb Perspect Biol 00:a001883, 2009.

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, jules j berman