Saturday, January 23, 2016

Validation versus Verification (in Data Science)

Validation is the process that checks whether the conclusions drawn from data analysis are correct. Validation usually starts with repeating the same analysis of the same data, using the methods that were originally recommended. Obviously, if a different set of conclusions is drawn from the same data and methods, the original conclusions cannot be validated. Validation may involve applying a different set of analytic methods to the same data, to determine if the conclusions are consistent. It is always reassuring to know that conclusions are repeatable, with different analytic methods. In prior eras, experiments were validated by repeating the entire experiment, thus producing a new set of observations for analysis. Many of today’s scientific experiments are far too complex and costly to repeat. In such cases, validation requires access to the complete collection of the original data, and to the detailed protocols under which the data was generated. One of the most useful methods of data validation involves testing new hypotheses, based on the assumed validity of the original conclusions. For example, if you were to accept Darwin’s analysis of barnacle data, leading to his theory of evolution, then you would expect to find a chronologic history of fossils in ascending layers of shale. This was the case; thus, paleontologists studying the Burgess shale reserves provided some validation to Darwin’s conclusions. Validation should not be mistaken for proof. Nonetheless, the reproducibility of conclusions, over time, with the same or different sets of data, and the demonstration of consistency with related observations, is about all that we can hope for in this imperfect world.

Verification is the process by which data is checked to determine whether the data was obtained properly (ie, according to approved protocols), and that the data accurately measured what it was intended to measure, on the correct specimens, and that all steps in data processing were completed in conformance to well-documented protocols. Verification often requires a level of expertise that is at least as high as the expertise of the individuals who produced the data.15 Data verification requires a full understanding of the many steps involved in data collection and can be a time-consuming task. In one celebrated case, in which two statisticians reviewed a microarray study performed at Duke University, the time devoted to their verification effort was reported to be 2000 hours. To put this statement in perspective, the official work-year, according to the U.S. Office of Personnel Management, is 2087 hours.

In short, verification is different from validation. Verification is performed on data; validation is done on the results of data analysis.

- Jules Berman (copyrighted material)

key words: data science, reproducibility, data integrity, verification, validation, jules j berman