Thursday, June 6, 2013

Condensed Principles of Big Data

Last night I re-read yesterday's post (Toward Big Data Immutability), and I realized that there really is no effective way to use this blog to teach anyone the mechanics of Big Data construction and analysis.  My guess is that many readers were confused by the blog, because a single post cannot provide the back-story to the concepts included in the post.

So, basically, I give up.  If you want to learn the fundamentals of Big Data, you'll need to do some reading  I would recommend my own book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.  Depending on your background and agenda, you might prefer one of the hundreds of other books written for this vibrant field (I won't be offended).

The best I can do is to summarize, with a few principles, the basic theme of my book.

1. You cannot create a good Big Data resource without good identifiers.  A Big Data resource can be usefully envisioned as a system of identifiers to which data is attached.

2. Data must be described with metadata, and the metadata descriptors should be organized under a classification or an ontology.  The latter will drive down the complexity of the system and will permit heterogeneous data to be shared, merged, and queried across systems.

3. Big Data must be immutable.  You can add to Big Data, but you can never alter or delete the contained data.

4. Big Data must be accessible to the public if it is to have any scientific value.  Unless members of the public have a chance to verify, validate, and examine your data, the conclusions drawn from the data have almost no scientific credibility.

5. Data analysis is important, but data re-analysis is much more important.  There are many ways to analyze data, and it's hard to know when your conclusions are correct.  If principles 1 through 4 are followed, the data can be re-examined at a later time.  If you can re-analyze data, then the original analysis is not so critical. Sometimes, a re-analysis that occurs years or decades after the original report, fortified with new data obtained in the interim, can have enormous value and consequence.

- Jules Berman

key words: identification, identifier system, mutable, mutability, immutability, big data analysis, repeated analysis, Big Data analyses, analytic techniques for Big Data, scientific validity of Big Data, open access Big Data, public access Big Data, Big Data concepts, Jules J. Berman, Ph.D., M.D.