Friday, March 4, 2016

Data Simplification: Chapter Synapses

Over the next few weeks, I will be blogging on topics selected from Data Simplification: Taming Information With Open Source Tools.

Those of you who are computer-oriented know that data analysis typically takes much less time and effort than data preparation. Moreover, if you make a mistake in your data analysis, you can often just repeat the process, using different tools, or a fresh approach to your original question. As long as the data is prepared properly, you and your colleagues can re-analyze your data to your heart's content. Contrariwise, if your data is not prepared in a manner that supports sensible analysis, there's little you can do to extricate yourself from the situation. For this reason, data preparation is, in my experience, much more important than data analysis.

Throughout my career, I've relied on simple open source utilities and short scripts to simplify my data, producing products that were self-explanatory, permanent, and that could be merged with other types of data. Hence, my book.


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.


Data Simplification: Taming Information With Open Source Tools
Publisher: Morgan Kaufmann; 1 edition (March 23, 2016)
ISBN-10: 0128037814
ISBN-13: 978-0128037812
Paperback: 398 pages
Dimensions: 7.5 x 9.2 inches

Chapter 1, The Simple Life, explores the thesis that complexity is the rate-limiting factor in human development. The greatest advances in human civilization and the most dramatic evolutionary improvements in all living organisms have followed the acquisition of methods that reduce or eliminate complexity.

Chapter 2, Structuring Text, reminds us that most of the data on the Web today is unstructured text, produced by individuals, trying their best to communicate with one another. Data simplification often begins with textual data. This chapter provides readers with tools and strategies for imposing some basic structure on free-text.

Chapter 3, Indexing Text, describes the often undervalued benefits of indexes. An index, aided by proper annotation of data, permits us to understand data in ways that were not anticipated when the original content was collected. With the use of computers, multiple indexes designed for differentpurposes, can be created for a single document or data set. As data accrues, indexes can be updated. When data sets are combined, their respective indexes can be merged. A good way of thinking about indexes is that the document contains all of the complexity; the index contains all of the simplicity. Data scientists who understand how to create and use indexes will be in the best position to search, retrieve, and analyze textual data. Methods are provided for automatically creating customized indexes designed for specific analytic pursuits and for binding index terms to standard nomenclatures.

Chapter 4, Understanding Your Data, describes how data can be quickly assessed, prior to formal quantitative analysis, to develop some insight into what the data means. A few simple visualization tricks and simple statistical descriptors can greatly enhance a data scientist’s understanding of complex and large data sets. Various types of data objects, such as text files, images, and time-series data, can be profiled with a summary signature that captures the key features that contribute to the behavior and content of the data object. Such profiles can be used to find relationships among different data objects, or to determine when data objects are not closely related to one another.

Chapter 5, Identifying and Deidentifying Data, tackles one of the most under-appreciated and least understood issues in data science. Measurements, annotations, properties, and classes of information have no informational meaning unless they are attached to an identifier that distinguishes one data object from all other data objects, and that links together all of the information that has been or will be associated with the identified data object. The method of identification and the selection of objects and classes to be identified relates fundamentally to the organizational model of complex data. If the simplifying step of data identification is ignored or implemented improperly, data cannot be shared, and conclusions drawn from the data cannot be believed. All well-designed information systems are, at their heart, identification systems: ways of naming data objects so that they can be retrieved. Only well-identified data can be usefully deidentified. This chapter discusses methods for identifying data and deidentifying data.

Chapter 6, Giving Meaning to Data, explores the meaning of meaning, as it applies to computer science. We shall learn that data, by itself, has no meaning. It is the job of the data scientist to assign meaning to data, and this is done with data objects, triples, and classifications (see Glossary items, Data object, Triple, Classification, Ontology). Unfortunately, coursework in the information sciences often omits discussion of the critical issue of "data meaning"; advancing from data collection to data analysis without stopping to design data objects whose relationships to other data objects are defined and discoverable. In this chapter, readers will learn how to prepare and classify meaningful data.

Chapter 7, Object-Oriented Data, shows how we can understand data, using a few elegant computational principles. Modern programming languages, particularly object-oriented programming languages, use introspective data (ie, the data with which data objects describe themselves) to modify the execution of a program at run-time; an elegant process known as reflection. Using introspection and reflection, programs can integrate data objects with related data objects. The implementations of introspection, reflection and integration, are among the most important achievements in the field of computer science.

Chapter 8, Problem Simplification, demonstrates that it is just as important to simplify problems as it is to simplify data. This final chapter provides simple but powerful methods for analyzing data, without resorting to advanced mathematical techniques. The use of random number generators to simulate the behavior of systems, and the application of Monte Carlo, resampling, and permutative methods to a wide variety of common problems in data analysis, will be discussed. The importance of data reanalysis, following preliminary analysis, is emphasized.
TABLE OF CONTENTS

Chapter 0. Preface
   References for Preface
   Glossary for Preface

Chapter 1. The Simple Life
   Section 1.1. Simplification drives scientific progress
   Section 1.2. The human mind is a simplifying machine
   Section 1.3. Simplification in Nature
   Section 1.4. The Complexity Barrier
   Section 1.5. Getting ready
   Open Source Tools for Chapter 1
      Perl
      Python
      Ruby
      Text Editors
      OpenOffice
      Command line utilities
      Cygwin, Linux emulation for Windows
      DOS batch scripts
      Linux bash scripts
      Interactive line interpreters
      Package installers
      System calls
References for Chapter 1
   Glossary for Chapter 1

Chapter 2. Structuring Text
   Section 2.1. The Meaninglessness of free text
   Section 2.2. Sorting text, the impossible dream
   Section 2.3. Sentence Parsing
   Section 2.4. Abbreviations
   Section 2.5. Annotation and the simple science of metadata
   Section 2.6. Specifications Good, Standards Bad
   Open Source Tools for Chapter 2
      ASCII
      Regular expressions
      Format commands
      Converting non-printable files to plain-text
      Dublin Core
   References for Chapter 2
   Glossary for Chapter 2

Chapter 3. Indexing Text
   Section 3.1. How Data Scientists Use Indexes
   Section 3.2. Concordances and Indexed Lists
   Section 3.3. Term Extraction and Simple Indexes
   Section 3.4. Autoencoding and Indexing with Nomenclatures
   Section 3.5. Computational Operations on Indexes
   Open Source Tools for Chapter 3
      Word lists
      Doublet lists
      Ngram lists
   References for Chapter 3
   Glossary for Chapter 3

Chapter 4. Understanding Your Data
   Section 4.1. Ranges and Outliers
   Section 4.2. Simple Statistical Descriptors
   Section 4.3. Retrieving Image Information
   Section 4.4. Data Profiling
   Section 4.5. Reducing data
   Open Source Tools for Chapter 4
      Gnuplot
      MatPlotLib
      R, for statistical programming
      Numpy
      Scipy
      ImageMagick
      Displaying equations in LaTex
      Normalized compression distance
      Pearson's correlation
      The ridiculously simple dot product
   References for Chapter 4 
   Glossary for Chapter 4

Chapter 5. Identifying and Deidentifying Data
   Section 5.1. Unique Identifiers
   Section 5.2. Poor Identifiers, Horrific Consequences
   Section 5.3. Deidentifiers and Reidentifiers
   Section 5.4. Data Scrubbing
   Section 5.5. Data Encryption and Authentication
   Section 5.6. Timestamps, Signatures, and Event Identifiers
   Open Source Tools for Chapter 5
      Pseudorandom number generators
      UUID
      Encryption and decryption with OpenSSL
      One-way hash implementations
      Steganography
   References for Chapter 5
   Glossary for Chapter 5

Chapter 6. Giving Meaning to Data
   Section 6.1. Meaning and Triples
   Section 6.2. Driving Down Complexity with Classifications
   Section 6.3. Driving Up Complexity with Ontologies
   Section 6.4. The unreasonable effectiveness of classifications
   Section 6.5. Properties that Cross Multiple Classes
   Open Source Tools for Chapter 6
      Syntax for triples
      RDF Schema
      RDF parsers
      Visualizing class relationships
   References for Chapter 6
   Glossary for Chapter 6

Chapter 7. Object-oriented data
   Section 7.1. The Importance of Self-explaining Data
   Section 7.2. Introspection and Reflection
   Section 7.3. Object-Oriented Data Objects
   Section 7.4. Working with Object-Oriented Data
   Open Source Tools for Chapter 7
      Persistent data
      SQLite databases
   References for Chapter 7
   Glossary for Chapter 7

Chapter 8. Problem simplification
   Section 8.1. Random numbers
   Section 8.2. Monte Carlo Simulations
   Section 8.3. Resampling and Permutating
   Section 8.4. Verification, Validation, and Reanalysis
   Section 8.5. Data Permanence and Data Immutability
   Open Source Tools for Chapter 8
      Burrows Wheeler transform
      Winnowing and chaffing
   References for Chapter 8
   Glossary for Chapter 8
- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, jules j berman