Friday, January 29, 2016

Misinterpretation of Results: The Most Pervasive Error in Data Science

The most common source of scientific errors are post-analytic, arising from the interpretation of data (1), (2), (3), (4), (5), (6). Pre-analytic errors and analytic errors, though common, are much less frequently encountered than interpretation errors. Virtually every journal article contains, hidden in the introduction and discussion sections, some distortion of fact or misleading assertion. Scientists cannot be objective about their own work. As humans, we tend to interpret observations to reinforce our beliefs and prejudices and to advance our agendas.

One of the most common strategies whereby scientists distort their own results, is to contrive self-serving conclusions; a process called message framing (7). In message framing, a scientist draws the his or her preferred conclusion, omitting from their discussion any pertinent findings that might diminish or discredit their own conclusions. The common practice of message framing is conducted on a subconscious, or at least a sub-rational, level. A scientist is not apt to read articles whose conclusions contradict his own hypotheses and will not cite disputatious works. Furthermore, if a paradigm is held in high esteem by a majority of the scientists in a field, then works that contradict the paradigm are not likely to pass peer review. Hence, it is difficult for contrary articles to be published in scientific journals. In any case, the message delivered in a journal article is almost always framed in a manner that promotes the author's interpretation.

It must be noted that throughout human history, no scientist has ever gotten into any serious trouble for misinterpreting results. Scientific misconduct comes, as a rule, from the purposeful production of bad data, either through falsification, fabrication, or through the refusal to remove and retract data that is known to be false, plagiarized, or otherwise invalid. In the U.S., allegations of research misconduct are investigated by the The Office of Research Integrity (ORI). Funding agencies in other countries have similar watchdog institutions. The ORI makes its findings a matter of public record (8). Of 150 cases investigated between 1993 and 1997, all but one case had an alleged component of data falsification, fabrication or plagiarism (9). In 2007, of the 28 investigated cases, 100% involved allegations of falsification, fabrication, or both (10). No cases of misconduct based on data misinterpretation were prosecuted (11).

Post-analytic misinterpretation of data is hard-wired into the human psyche. Agencies tasked with ensuring scientific integrity have never seriously confronted the problem of data misinterpretation. Why would they? You can't fight human nature.

In 2011, amidst much fanfare, NASA scientists announced that a new form of life was found on earth, a microorganism that thrived in the high concentrations of arsenic prevalent in Mono Lake, California. The microorganism was shown to incorporate arsenic into its DNA, instead of the phosphorus used by all other known terrestrial organisms. Thus, the newfound organism synthesized a previously unknown type of genetic material (12). NASA's associate administrator for the Science Mission Directorate, at the time, wrote, "The definition of life has just expanded." (13) The Director of the NASA Astrobiology Institute at the agency's Ames Research Center in Moffett Field, California, wrote "Until now a life form using arsenic as a building block was only theoretical, but now we know such life exists in Mono Lake." (13)

Heady stuff! Soon thereafter, other scientists tried but failed to confirm the earlier findings (14). It seems that the new life form was just another old life form, and the arsenic was a hard-to-wash cellular contaminant (11). The best scientists on the planet cannot resist the lure of a scientific interpretation that promotes their own agenda.

The first analysis of data is usually wrong and irreproducible. Erroneous results and misleading conclusions are regularly published by some of the finest laboratories in the most prestigious institutions in the world (15), (16), (17), (18), (19), (20), (21), (22), (23), (24), (25), (26), (19), (27). Every scientific study must be verified and validated, and the most effective way to ensure that verification and validation take place is to release your data for public review.

References:

[1] Ioannidis JP. Is molecular profiling ready for use in clinical decision making? The Oncologist 12:301-311, 2007.

[2] Ioannidis JP. Why most published research findings are false. PLoS Med 2:e124, 2005.

[3] Ioannidis JP. Some main problems eroding the credibility and relevance of randomized trials. Bull NYU Hosp Jt Dis 66:135-139, 2008.

[4] Ioannidis JP. Microarrays and molecular research: noise discovery? The Lancet 365:454-455, 2005.

[5] Ioannidis JP, Panagiotou OA. Comparison of effect sizes associated with biomarkers reported in highly cited individual articles and in subsequent meta-analyses. JAMA 305:2200-2210, 2011.

[6] Berman JJ. Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Morgan Kaufmann, Waltham, MA, 2013.

[7] Wilson JR. Rhetorical Strategies Used in the Reporting of Implantable Defibrillator Primary Prevention Trials. Am J Cardiol 107:1806-1811, 2011

[8] Office of Research Integrity. Available from: http://ori.dhhs.gov

[9] Scientific Misconduct Investigations. 1993-1997. Office of Research Integrity, Office of Public Health and Science, Department of Health and Human Services, December, 1998.

[10] Office of Research Integrity Annual Report 2007, June 2008. Available from: http://ori.hhs.gov/images/ddblock/ori_annual_report_2007.pdf, viewed Jan. 1, 2015.

[11] Berman JJ. Repurposing Legacy Data: Innovative Case Studies. Morgan Kaufmann, Waltham, MA, 2015.

[12] Wolfe-Simon F, Switzer Blum J, Kulp TR, Gordon GW, Hoeft SE, Pett-Ridge J, et al. A Bacterium That Can Grow by Using Arsenic Instead of Phosphorus. Science 332:1163-1166, 2011.

[13] Discovery of "Arsenic-bug" Expands Definition of Life. NASA December 2, 2010.

[14] Reaves ML, Sinha S, Rabinowitz JD, Kruglyak L, Redfield RJ. Absence of arsenate in DNA from arsenate-grown GFAJ-1 cells. Science 337:470-473, 2012.

[15] Knight, J. Agony for researchers as mix-up forces retraction of ecstasy study. Nature 425:109, September 11, 2003.

[16] Hwang WS, Roh SI, Lee BC, Kang SK, Kwon DK, Kim S, et al. Patient-specific embryonic stem cells derived from human SCNT blastocysts. Science 308:1777-1783, 2005.

[17] Hajra A, Collins FS. Structure of the leukemia-associated human CBFB gene. Genomics 26:571-579, 1995.

[18] Altman LK. Falsified data found in gene studies. The New York Times October 30, 1996.

[19] Findings of scientific misconduct. NIH Guide Volume 26, Number 23, July 18, 1997 Available from: http://grants.nih.gov/grants/guide/notice-files/not97-151.html

[20] Bren L. Human Research Reinstated at Johns Hopkins, With Conditions. U.S. Food and Drug Administration, FDA Consumer magazine, September-October, 2001.

[21] Kolata G. Johns Hopkins Admits Fault in Fatal Experiment. The New York Times July 17, 2001.

[22] Brooks D. The Chosen: Getting in. The New York Times, November 6, 2005.

[23] Seward Z. MIT Admissions dean resigns; admits misleading school on credentials degrees from three colleges were fabricated, MIT says. Harvard Crimson, April 26, 2007.

[24] Salmon A, Hawkes N. Clone 'hero' resigns after scandal over donor eggs. The Times, November 25, 2005.

[25] Wilson D. Harvard Medical School in Ethics Quandary. The New York Times March 3, 2009.

[26] Findings of Scientific Misconduct. NOT-OD-05-009. November 22, 2004. Available from: http://grants.nih.gov/grants/guide/notice-files/NOT-OD-05-009.html

[27] Hajra A, Liu PP, Wang Q, Kelley CA, Stacy T, Adelstein RS, et al. The leukemic core binding factor -smooth muscle myosin heavy chain (CBF-SMMHC) chimeric protein requires both CBF and myosin heavy chain domains for transformation of NIH 3T3 cells. Proc Natl Acad Sci USA 92:1926-1930, 1995.

- Jules Berman (copyrighted material)

key words: data analysis, data science, misintepretation of results, distorting results, result bias, author bias, paradigm bias, data interpretation, jules j berman