Wednesday, March 2, 2016

Data Simplification: Why Bother?

"Order and simplification are the first steps toward the mastery of a subject." -Thomas Mann

Complex data is difficult to understand and analyze. Large data projects, using complex sets of data, are likely to fail; furthermore, the more money spent on a data project, the greater the likelihood of failure (1), (2), (3), (4), (5), (6), (7), (8), (9). What is true for data projects is also true in the experimental sciences; large and complex projects are often unrepeatable (10), (11), (12), (13), (14), (15), (16), (17), (18), (19), (20), (21), (22), (23), (24), (25), (26), (27), (28). Basically, complexity is something that humans have never mastered. As a species, we work much better when things are made simple.

Intelligent data scientists soon learn that it is nearly impossible to conduct competent and reproducible analyses of highly complex data. Inevitably, something always goes wrong. This book was written to provide a set of general principles and methods for data simplification. Here are the points that establish the conceptual framework of Data simplification: taming information with open source tools:

1) The first step in data analysis is data simplification. Most modern data scientists can expect that the majority of their time will be spent collecting, organizing, annotating, and simplifying their data, preparatory to analysis. A relatively small fraction of their time will be spend directly analyzing the data. When data has been simplified, successful analysis can proceed using standard computational methods, on standard computers.

2) Results obtained from unsimplified projects are nearly always irreproducible. Hence, the results of analyses on complex data cannot be validated. Conclusions that cannot be validated have no scientific value.

3) Data simplification is not simple. There is something self-defeating about the term, "data simplification". The term seems to imply a dumbing down process wherein naturally complex concepts are presented in a manner that is palatable to marginally competent scientists. Nothing can be further from the truth. Creating overly complex data has always been the default option for lazy-minded or cavalier scientists who lacked the will or the talent to produce a simple, well-organized, and well-annotated collection of data. The act of data simplification will always be one of the most challenging tasks facing data scientists, often requiring talents drawn from multiple disciplines. The sad truth is that there are very few data professionals who are competent to perform data simplification; and fewer still educators who can adequately teach the subject.

4) No single software application will solve your data simplification needs (29). Applications that claim to do everything for the user are, in most instances, applications that require the user to do everything for the application. The most useful software solutions often come in the form of open source utilities, designed to perform one method, very well, and very fast. In this book, dozens of freely available utilities are demonstrated.

5) Data simplification tools are data discovery tools, in the hands of creative individuals. The act of data simplification always gives the scientist a better understanding of the meaning of the data. Data that has been organized and annotated competently provides us with new questions, new hypotheses, and new approaches to problem solving. Data that is complex only provides headaches.

6) Data simplification is a prerequisite for data preservation. Data that has not been simplified has no useful shelf-life. After a data project has ended, nobody will be able to understand what was done. This means no future projects will build upon the original data, or find new purposes for the data. Moreover, conclusions drawn from the original data will never be verified or validated. This means that when you do not simplify your data, your colleagues will not accept your conclusions. Those who understand the principles and practice of data simplification will produce credible data that can be validated and repurposed.

7) Data simplification saves money. Data simplification often involves developing general solutions that apply to classes of data. By eliminating the cost of using made-to-order proprietary software, data scientists can increase their productivity and reduce their expenses.

8) Learning the methods and tools of data simplification is a great career move. Data simplification is the the next big thing in the data sciences. The most thoughtful employers understand that it's not always about keeping it simple. More often, it's about making it simple.

9) Data scientists should have familiarity with more than one programming language. Although one high-level language has much the same functionality as another, each language may have particular advantages in different situations. For example, a programmer may prefer Perl when her tasks involve text parsing and string pattern matches. Another programmer might prefer Python if she requires a variety of numeric or analytic functions and a smooth interface to a graphing tool. Programmers who work with classes of data objects, or who need to model new classifications, might prefer the elegant syntax and rich class libraries available in Ruby. Books that draw on a single programming language run the risk of limiting the problem-solving options of their readers. Although there are many high-quality programming languages, I have chosen Perl, Python and Ruby as the demonstration languages for this book. Each of these popular languages is free, open source, and can be installed easily and quickly on virtually any operating system. By offering solutions in several different programming languages, this book may serve as a sort of Rosetta stone for data scientists who must work with data structures produced in different programming environments.

Over the next few weeks, I will be blogging on topics selected from Data Simplification: Taming Information With Open Source Tools. I hope I can convince you that this is a book worth reading.

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, jules j berman

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.


[1] Kappelman LA, McKeeman R, Lixuan Zhang L. Early warning signs of IT project failure: the dominant dozen. Information Systems Management 23:31-36, 2006.

[2] Arquilla J. The Pentagon's biggest boondoggles. The New York Times (Opinion Pages) March 12, 2011.

[3] Lohr S. Lessons From Britain's Health Information Technology Fiasco. The New York Times Sept. 27, 2011.

[4] Dismantling the NHS national programme for IT. Department of Health Media Centre Press Release. September 22, 2011. Available from: viewed June 12, 2012.

[5] Whittaker Z. UK's delayed national health IT programme officially scrapped. ZDNet September 22, 2011.

[6] Lohr S. Google to end health records service after it fails to attract users. The New York Times Jun 24, 2011.

[7] Schwartz E. Shopping for health software, some doctors get buyer's remorse. The Huffington Post Investigative Fund Jan 29, 2010.

[8] Heeks R, Mundy D, Salazar A. Why health care information systems succeed or fail. Institute for Development Policy and Management, University of Manchester, June 1999 Available from:, viewed July 12, 2012.

[9] Beizer B. Software Testing Techniques. Van Nostrand Reinhold; Hoboken, NJ 2 edition, 1990.

[10] Unreliable research: Trouble at the lab. The Economist October 19, 2013.

[11] Kolata G. Cancer fight: unclear tests for new drug. The New York Times April 19, 2010.

[12] Ioannidis JP. Why most published research findings are false. PLoS Med 2:e124, 2005.

[13] Baker M. Reproducibility crisis: Blame it on the antibodies. Nature 521:274-276, 2015.

[14] Naik G. Scientists' Elusive Goal: Reproducing Study Results. Wall Street Journal December 2, 2011.

[15] Innovation or Stagnation: Challenge and Opportunity on the Critical Path to New Medical Products. U.S. Department of Health and Human Services, Food and Drug Administration, 2004.

[16] Hurley D. Why Are So Few Blockbuster Drugs Invented Today? The New York Times November 13, 2014.

[17] Angell M. The Truth About the Drug Companies. The New York Review of Books Vol 51, July 15, 2004.

[18] Crossing the Quality Chasm: A New Health System for the 21st Century. Quality of Health Care in America Committee, editors. Institute of Medicine, Washington, DC., 2001.

[19] Wurtman RJ, Bettiker RL. The slowing of treatment discovery, 1965-1995. Nat Med 2:5-6, 1996.

[20] Ioannidis JP. Microarrays and molecular research: noise discovery? The Lancet 365:454-455, 2005.

[21] Weigelt B, Reis-Filho JS. Molecular profiling currently offers no more than tumour morphology and basic immunohistochemistry. Breast Cancer Research 12:S5, 2010.

[22] Personalised medicines: hopes and realities. The Royal Society, London, 2005.Available from:, viewed Jan 1, 2015.

[23] Vlasic B. Toyota's slow awakening to a deadly problem. The New York Times, February 1, 2010.

[24] Lanier J. The complexity ceiling. In: Brockman J, ed. The next fifty years: science in the first half of the twenty-first century. Vintage, New York, pp 216-229, 2002.

[25] Ecker JR, Bickmore WA, Barroso I, Pritchard JK, Gilad Y, Segal E. Genomics: ENCODE explained. Nature 489:52-55, 2012.

[26] Rosen JM, Jordan CT. The increasing complexity of the cancer stem cell paradigm. Science 324:1670-1673, 2009.

[27] Labos C. It Ain't Necessarily So: Why Much of the Medical Literature Is Wrong. Medscape News and Perspectives. September 09, 2014

[28] Gilbert E, Strohminger N. We found only one-third of published psychology research is reliable - now what? The Conversation. August 27, 2015. Available at:, viewed on August 27,2015.

[29] Brooks FP. No silver bullet: essence and accidents of software engineering. Computer 20:10-19, 1987. Comment. This early (1987) paper tackles the problem of software complexity, emphasizing the importance of software design. He also suggests that one of the best ways for institutions to achieve computer productivity is to provide staff with basic applications (e.g., wordprocessor, spreadsheet, statistical packages) and train them with simple programming skills.

No comments: