My Big Data book
In yesterday's blog, I announced the publication of my new book, Principles of Big Data, 1st Edition: Preparing, Sharing, and Analyzing Complex Information. Here is a short essay describing some of the features that distinguish this Big Data book from all of the others.
The book describes:
- How do deal with complex data objects (unstructured text, categorical data, quantitative data, images, etc.), and how to extract small data sets (the kind you're probably familiar with), from Big Data resources.
- How to create Big Data resources in a legal, ethical and scientifically sensible manner.
- How to inspect and analyze Big Data.
- How to verify and validate the data and the conclusions drawn from the data.
- Identifiers and deidentification. Most people simply do not understand the importance of creating unique identifiers for data objects. In the book, I build an argument that Big Data resources are basically just a set of identifiers to which data is attached. Without proper identifiers there can be no useful analysis of Big Data. The book goes into some detail explaining how data objects can be identified and de-identified, and how data identifiers are crucial for merging data obtained from heterogeneous data sources.
- Metadata, Classification, and Introspection. Classifications drive down the complexity of Big Data, and metadata permits data objects to contain self-descriptive information about the data contained in the object, and the placement of the data object within the classification. The ability of data objects to provide information about themselves is called introspection. It is another property of serious Big Data resources that allows data from different resources to be shared and merged.
- Immutability. The data within Big Data resources must be immutable. This means that if a data objects has a certain value at a certain time, then it will retain that value forever. From the medical field, if I have a glucose level of 85 on Tuesday, and if I have a new measurement taken on Friday, which tells me that my glucose level is 105, then the glucose level of 105 does not replace the earlier level. It merely, creates a second value, with its own time-stamp and its own identifier, both belonging to some data object (e.g., the data object that contains the lab tests of Jules Berman). In the book, I emphasize the practical importance of building immutability into Big Data resources, and how this might be achieved.
- Estimation. I happen to believe that almost every Big Data project should start off with a quick and dirty estimation. Most of the sophisticated analytic methods simply improve upon simple and intuitive data analysis methods. I devote several chapters to describing how data users should approach Big Data projects, how to assess the available data, and how to quickly estimate your results.
- Failures. It may come as a shock, but most Big Data efforts fail. Money, staff, and expertise cannot compensate for resources that overlook the fundamental properties of Big Data. In this book, I discuss and dissect some of the more spectacular failures, including several from the realm of Big Science.
Amazon has already released its Kindle version of the book.
tags: Big Data, Jules J. Berman, mutability, de-identification, troubleshooting Big Data