Saturday, January 30, 2016


"The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn't available." -Peter Norvig, Alon Halevy, and Ferdinand Pereira (1)

Despite the preponderance of old data, most data scientists devote their efforts to newly acquired data or to nonexistent data that may emerge in the unknowable future. Why does old data get such little respect? The reasons are manifold.

1. Much of old data is proprietary and cannot be accessed by anyone other than its owners.

2. The owners of proprietary data, in many cases, are barely aware of the contents, or even the existence of their own data, and have no understanding of the value of their holdings, to themselves or to others.

3. Old data is typically stored in formats that are inscrutable to young data scientists. The technical expertise required to use the data intelligibly is unavailable.

4. Much of old data lacks proper annotation. There simply is not sufficient information about the data (e.g., how it was collected and what the data means) to support useful analysis.

5. Much of old data, annotated or not, has not been indexed in any serious way. There is no easy method of searching the contents of old data.

6 Much of old data is poor data, collected without the kinds of quality assurances that would be required to support any useful analysis of its contents.

7. Old data is orphaned data. When data has no guardianship, the tendency is to ignore the data or to greatly underestimate its value.

The sheer messiness of old data is conveyed by the gritty jargon that permeates the field of data repurposing (Data cleaning, Data mining, Data munging, Data scraping, Data scrubbing, Data wrangling). Anything that requires munging, scraping, and scrubbing can't be too clean.

Data sources are referred to as "old" or "legacy"; neither term calls to mind vitality or robustness. A helpful way of thinking about the subject is to recognize that new data is just updated old data. New data (See Glossary item, New Data, below), without old data, cannot be used for the purpose of seeing long-term trends.

Nobody seems to put enough value on legacy data. Nobody seems to want to pay for legacy data and nobody seems to invest in preserving legacy data. The stalwart data scientist must not be discouraged. As I'll show in future blogs, preserving old data is definitely worth the bother.


New data - It is natural to think of certain objects as being "new", meaning, with no prior existence; and other objects being "old", having persisted from an earlier time, and into the present. In truth, there are very few "new" objects in our universe. Most objects arise in a continuum, through a transformation or a modification of an old object. For example, embryos are simply cellular growths that develop from pre-existing gonocytes, and the development of an embryo into a newborn organism, that is not really new at all, follows an ancient path written by combined fragments of pre-existing DNA sequences. When we speak of "new" data, alternately known as prospectively acquired data or as prospective data, we must think in terms that relate the new data to the "old" data that preceded it. For example the air temperature one minute from now is largely determined by weather events that are occurring now, and the weather occurring now is largely determined by all of the weather events that have occurred in the history of our planet. Data scientists have a pithy aphorism that captures the entangled relationship between "new" and "old" data: "Every prospective study becomes a retrospective study on day two".


[1] Norvig P, Halevy A, Pereira F. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24:8-12, 2009.

- Jules Berman (copyrighted material)

key words: data repurposing, data science, using data, data simplification, legacy data, jules j berman