Specified Life: May 2013

Friday, May 31, 2013

Big Data Book Explained

My Big Data book

In yesterday's blog, I announced the publication of my new book, Principles of Big Data, 1st Edition: Preparing, Sharing, and Analyzing Complex Information. Here is a short essay describing some of the features that distinguish this Big Data book from all of the others.

The book describes:

How do deal with complex data objects (unstructured text, categorical data, quantitative data, images, etc.), and how to extract small data sets (the kind you're probably familiar with), from Big Data resources.
How to create Big Data resources in a legal, ethical and scientifically sensible manner.
How to inspect and analyze Big Data.
How to verify and validate the data and the conclusions drawn from the data.

.The book expands upon several subjects that are omitted from most Big Data books.

Identifiers and deidentification. Most people simply do not understand the importance of creating unique identifiers for data objects. In the book, I build an argument that Big Data resources are basically just a set of identifiers to which data is attached. Without proper identifiers there can be no useful analysis of Big Data. The book goes into some detail explaining how data objects can be identified and de-identified, and how data identifiers are crucial for merging data obtained from heterogeneous data sources.
Metadata, Classification, and Introspection. Classifications drive down the complexity of Big Data, and metadata permits data objects to contain self-descriptive information about the data contained in the object, and the placement of the data object within the classification. The ability of data objects to provide information about themselves is called introspection. It is another property of serious Big Data resources that allows data from different resources to be shared and merged.
Immutability. The data within Big Data resources must be immutable. This means that if a data objects has a certain value at a certain time, then it will retain that value forever. From the medical field, if I have a glucose level of 85 on Tuesday, and if I have a new measurement taken on Friday, which tells me that my glucose level is 105, then the glucose level of 105 does not replace the earlier level. It merely, creates a second value, with its own time-stamp and its own identifier, both belonging to some data object (e.g., the data object that contains the lab tests of Jules Berman). In the book, I emphasize the practical importance of building immutability into Big Data resources, and how this might be achieved.
Estimation. I happen to believe that almost every Big Data project should start off with a quick and dirty estimation. Most of the sophisticated analytic methods simply improve upon simple and intuitive data analysis methods. I devote several chapters to describing how data users should approach Big Data projects, how to assess the available data, and how to quickly estimate your results.
Failures. It may come as a shock, but most Big Data efforts fail. Money, staff, and expertise cannot compensate for resources that overlook the fundamental properties of Big Data. In this book, I discuss and dissect some of the more spectacular failures, including several from the realm of Big Science.

A full Table of Contents for Principles of Big Data, 1st Edition: Preparing, Sharing, and Analyzing Complex Information is found at my web site. You can purchase the book from the Elsevier site. At the site, the listed publication date is June 15, but my editor assures me that the warehouse has copies of the book, and that the official publication date has been moved to June 4. If you have any trouble placing the order, you can wait until June 4 and try again.

Amazon has already released its Kindle version of the book.

Jules Berman

tags: Big Data, Jules J. Berman, mutability, de-identification, troubleshooting Big Data

Thursday, May 30, 2013

Big Data Book Contents

I've taken a hiatus from the Specified Life blog while I wrote my latest book, entitled, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.

The Kindle edition is available now, and Amazon has a "look-inside" option on their book page. The print version will be available in a week or two, and Amazon is taking pre-orders Here is the complete Table of Contents:

Acknowledgments xi
Author Biography xiii
Preface xv
Introduction xix

1. Providing Structure to Unstructured Data
  Background 1
  Machine Translation 2
  Autocoding 4
  Indexing 9
  Term Extraction 11

2. Identification, Deidentification, and Reidentification
  Background 15
  Features of an Identifier System 17
  Registered Unique Object Identifiers 18
  Really Bad Identifier Methods 22
  Embedding Information in an Identifier: Not Recommended 24
  One-Way Hashes 25
  Use Case: Hospital Registration 26
  Deidentification 28
  Data Scrubbing 30
  Reidentification 31
  Lessons Learned 32

3. Ontologies and Semantics
  Background 35
  Classifications, the Simplest of Ontologies 36
  Ontologies, Classes with Multiple Parents 39
  Choosing a Class Model 40
  Introduction to Resource Description Framework Schema 44
  Common Pitfalls in Ontology Development 46

4. Introspection
  Background 49
  Knowledge of Self 50
  eXtensible Markup Language 52
  Introduction to Meaning 54
  Namespaces and the Aggregation of Meaningful Assertions 55
  Resource Description Framework Triples 56
  Reflection 59
  Use Case: Trusted Time Stamp 59
  Summary 60

5. Data Integration and Software Interoperability
  Background 63
  The Committee to Survey Standards 64
  Standard Trajectory 65
  Specifications and Standards 69
  Versioning 71
  Compliance Issues 73
  Interfaces to Big Data Resources 74

6. Immutability and Immortality
  Background 77
  Immutability and Identifiers 78
  Data Objects 80
  Legacy Data 82
  Data Born from Data 83
  Reconciling Identifiers across Institutions 84
  Zero-Knowledge Reconciliation 86
  The Curator’s Burden 87

7. Measurement
  Background 89
  Counting 90
  Gene Counting 93
  Dealing with Negations 93
  Understanding Your Control 95
  Practical Significance of Measurements 96
  Obsessive-Compulsive Disorder: The Mark of a Great Data Manager 97

8. Simple but Powerful Big Data Techniques
  Background 99
  Look at the Data 100
  Data Range 110
  Denominator 112
  Frequency Distributions 115
  Mean and Standard Deviation 119
  Estimation-Only Analyses 122
  Use Case: Watching Data Trends with Google Ngrams 123
  Use Case: Estimating Movie Preferences 126

9. Analysis
  Background 129
  Analytic Tasks 130
  Clustering, Classifying, Recommending, and Modeling 130
  Data Reduction 134
  Normalizing and Adjusting Data 137
  Big Data Software: Speed and Scalability 139
  Find Relationships, Not Similarities 141

10. Special Considerations in Big Data Analysis
  Background 145
  Theory in Search of Data 146
  Data in Search of a Theory 146
  Overfitting 148
  Bigness Bias 148
  Too Much Data 151
  Fixing Data 152
  Data Subsets in Big Data: Neither Additive nor Transitive 153
  Additional Big Data Pitfalls 154

11. Stepwise Approach to Big Data Analysis
  Background 157
  Step 1. A Question Is Formulated 158
  Step 2. Resource Evaluation 158
  Step 3. A Question Is Reformulated 159
  Step 4. Query Output Adequacy 160
  Step 5. Data Description 161
  Step 6. Data Reduction 161
  Step 7. Algorithms Are Selected, If Absolutely Necessary 162
  Step 8. Results Are Reviewed and Conclusions Are Asserted 164
  Step 9. Conclusions Are Examined and Subjected to Validation 164

12. Failure
  Background 167
  Failure Is Common 168
  Failed Standards 169
  Complexity 172
  When Does Complexity Help? 173
  When Redundancy Fails 174
  Save Money; Don’t Protect Harmless Information 176
  After Failure 177
  Use Case: Cancer Biomedical Informatics Grid, a Bridge Too Far 178

13. Legalities
  Background 183
  Responsibility for the Accuracy and Legitimacy of Contained Data 184
  Rights to Create, Use, and Share the Resource 185
  Copyright and Patent Infringements Incurred by Using Standards 187
  Protections for Individuals 188
  Consent 190
  Unconsented Data 194
  Good Policies Are a Good Policy 197
  Use Case: The Havasupai Story 198

14. Societal Issues
  Background 201
  How Big Data Is Perceived 201
  The Necessity of Data Sharing, Even When It Seems Irrelevant 204
  Reducing Costs and Increasing Productivity with Big Data 208
  Public Mistrust 210
  Saving Us from Ourselves 211
  Hubris and Hyperbole 213

15. The Future
  Background 217
  Last Words 226

Glossary 229

References 247

Index 257

In the next few days, I'll be posting short excerpts from the book, along with commentary. Best,
Jules Berman

key words: big data, heterogeneous data, complex datasets, Jules J. Berman, Ph.D., M.D., immutability, introspection, identifiers, de-identification, deidentification, confidentiality, privacy, massive data, lotsa data