Thursday, May 30, 2013

Big Data Book Contents

I've taken a hiatus from the Specified Life blog while I wrote my latest book, entitled, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.



The Kindle edition is available now, and Amazon has a "look-inside" option on their book page. The print version will be available in a week or two, and Amazon is taking pre-orders Here is the complete Table of Contents:
Acknowledgments xi
Author Biography xiii
Preface xv
Introduction xix

1. Providing Structure to Unstructured Data
  Background 1
  Machine Translation 2
  Autocoding 4
  Indexing 9
  Term Extraction 11

2. Identification, Deidentification, and Reidentification
  Background 15
  Features of an Identifier System 17
  Registered Unique Object Identifiers 18
  Really Bad Identifier Methods 22
  Embedding Information in an Identifier: Not Recommended 24
  One-Way Hashes 25
  Use Case: Hospital Registration 26
  Deidentification 28
  Data Scrubbing 30
  Reidentification 31
  Lessons Learned 32

3. Ontologies and Semantics
  Background 35
  Classifications, the Simplest of Ontologies 36
  Ontologies, Classes with Multiple Parents 39
  Choosing a Class Model 40
  Introduction to Resource Description Framework Schema 44
  Common Pitfalls in Ontology Development 46

4. Introspection
  Background 49
  Knowledge of Self 50
  eXtensible Markup Language 52
  Introduction to Meaning 54
  Namespaces and the Aggregation of Meaningful Assertions 55
  Resource Description Framework Triples 56
  Reflection 59
  Use Case: Trusted Time Stamp 59
  Summary 60

5. Data Integration and Software Interoperability
  Background 63
  The Committee to Survey Standards 64
  Standard Trajectory 65
  Specifications and Standards 69
  Versioning 71
  Compliance Issues 73
  Interfaces to Big Data Resources 74

6. Immutability and Immortality
  Background 77
  Immutability and Identifiers 78
  Data Objects 80
  Legacy Data 82
  Data Born from Data 83
  Reconciling Identifiers across Institutions 84
  Zero-Knowledge Reconciliation 86
  The Curator’s Burden 87

7. Measurement
  Background 89
  Counting 90
  Gene Counting 93
  Dealing with Negations 93
  Understanding Your Control 95
  Practical Significance of Measurements 96
  Obsessive-Compulsive Disorder: The Mark of a Great Data Manager 97

8. Simple but Powerful Big Data Techniques
  Background 99
  Look at the Data 100
  Data Range 110
  Denominator 112
  Frequency Distributions 115
  Mean and Standard Deviation 119
  Estimation-Only Analyses 122
  Use Case: Watching Data Trends with Google Ngrams 123
  Use Case: Estimating Movie Preferences 126

9. Analysis
  Background 129
  Analytic Tasks 130
  Clustering, Classifying, Recommending, and Modeling 130
  Data Reduction 134
  Normalizing and Adjusting Data 137
  Big Data Software: Speed and Scalability 139
  Find Relationships, Not Similarities 141

10. Special Considerations in Big Data Analysis
  Background 145
  Theory in Search of Data 146
  Data in Search of a Theory 146
  Overfitting 148
  Bigness Bias 148
  Too Much Data 151
  Fixing Data 152
  Data Subsets in Big Data: Neither Additive nor Transitive 153
  Additional Big Data Pitfalls 154

11. Stepwise Approach to Big Data Analysis
  Background 157
  Step 1. A Question Is Formulated 158
  Step 2. Resource Evaluation 158
  Step 3. A Question Is Reformulated 159
  Step 4. Query Output Adequacy 160
  Step 5. Data Description 161
  Step 6. Data Reduction 161
  Step 7. Algorithms Are Selected, If Absolutely Necessary 162
  Step 8. Results Are Reviewed and Conclusions Are Asserted 164
  Step 9. Conclusions Are Examined and Subjected to Validation 164

12. Failure
  Background 167
  Failure Is Common 168
  Failed Standards 169
  Complexity 172
  When Does Complexity Help? 173
  When Redundancy Fails 174
  Save Money; Don’t Protect Harmless Information 176
  After Failure 177
  Use Case: Cancer Biomedical Informatics Grid, a Bridge Too Far 178

13. Legalities
  Background 183
  Responsibility for the Accuracy and Legitimacy of Contained Data 184
  Rights to Create, Use, and Share the Resource 185
  Copyright and Patent Infringements Incurred by Using Standards 187
  Protections for Individuals 188
  Consent 190
  Unconsented Data 194
  Good Policies Are a Good Policy 197
  Use Case: The Havasupai Story 198

14. Societal Issues
  Background 201
  How Big Data Is Perceived 201
  The Necessity of Data Sharing, Even When It Seems Irrelevant 204
  Reducing Costs and Increasing Productivity with Big Data 208
  Public Mistrust 210
  Saving Us from Ourselves 211
  Hubris and Hyperbole 213

15. The Future
  Background 217
  Last Words 226

Glossary 229

References 247

Index 257
In the next few days, I'll be posting short excerpts from the book, along with commentary. Best,
Jules Berman

key words: big data, heterogeneous data, complex datasets, Jules J. Berman, Ph.D., M.D., immutability, introspection, identifiers, de-identification, deidentification, confidentiality, privacy, massive data, lotsa data