Those of you who are computer-oriented know that data analysis typically takes much less time and effort than data preparation. Moreover, if you make a mistake in your data analysis, you can often just repeat the process, using different tools, or a fresh approach to your original question. As long as the data is prepared properly, you and your colleagues can re-analyze your data to your heart's content. Contrariwise, if your data is not prepared in a manner that supports sensible analysis, there's little you can do to extricate yourself from the situation. For this reason, data preparation is, in my experience, much more important than data analysis.
Throughout my career, I've relied on simple open source utilities and short scripts to simplify my data, producing products that were self-explanatory, permanent, and that could be merged with other types of data. Hence, my book.
Data Simplification: Taming Information With Open Source Tools
Publisher: Morgan Kaufmann; 1 edition (March 23, 2016)
ISBN-10: 0128037814
ISBN-13: 978-0128037812
Paperback: 398 pages
Dimensions: 7.5 x 9.2 inches
Chapter 1, The Simple Life, explores the thesis that complexity is the rate-limiting factor in human development. The greatest advances in human civilization and the most dramatic evolutionary improvements in all living organisms have followed the acquisition of methods that reduce or eliminate complexity.
Chapter 2, Structuring Text, reminds us that most of the data on the Web today is unstructured text, produced by individuals, trying their best to communicate with one another. Data simplification often begins with textual data. This chapter provides readers with tools and strategies for imposing some basic structure on free-text.
Chapter 3, Indexing Text, describes the often undervalued benefits of indexes. An index, aided by proper annotation of data, permits us to understand data in ways that were not anticipated when the original content was collected. With the use of computers, multiple indexes designed for differentpurposes, can be created for a single document or data set. As data accrues, indexes can be updated. When data sets are combined, their respective indexes can be merged. A good way of thinking about indexes is that the document contains all of the complexity; the index contains all of the simplicity. Data scientists who understand how to create and use indexes will be in the best position to search, retrieve, and analyze textual data. Methods are provided for automatically creating customized indexes designed for specific analytic pursuits and for binding index terms to standard nomenclatures.
Chapter 4, Understanding Your Data, describes how data can be quickly assessed, prior to formal quantitative analysis, to develop some insight into what the data means. A few simple visualization tricks and simple statistical descriptors can greatly enhance a data scientist’s understanding of complex and large data sets. Various types of data objects, such as text files, images, and time-series data, can be profiled with a summary signature that captures the key features that contribute to the behavior and content of the data object. Such profiles can be used to find relationships among different data objects, or to determine when data objects are not closely related to one another.
Chapter 5, Identifying and Deidentifying Data, tackles one of the most under-appreciated and least understood issues in data science. Measurements, annotations, properties, and classes of information have no informational meaning unless they are attached to an identifier that distinguishes one data object from all other data objects, and that links together all of the information that has been or will be associated with the identified data object. The method of identification and the selection of objects and classes to be identified relates fundamentally to the organizational model of complex data. If the simplifying step of data identification is ignored or implemented improperly, data cannot be shared, and conclusions drawn from the data cannot be believed. All well-designed information systems are, at their heart, identification systems: ways of naming data objects so that they can be retrieved. Only well-identified data can be usefully deidentified. This chapter discusses methods for identifying data and deidentifying data.
Chapter 6, Giving Meaning to Data, explores the meaning of meaning, as it applies to computer science. We shall learn that data, by itself, has no meaning. It is the job of the data scientist to assign meaning to data, and this is done with data objects, triples, and classifications (see Glossary items, Data object, Triple, Classification, Ontology). Unfortunately, coursework in the information sciences often omits discussion of the critical issue of "data meaning"; advancing from data collection to data analysis without stopping to design data objects whose relationships to other data objects are defined and discoverable. In this chapter, readers will learn how to prepare and classify meaningful data.
Chapter 7, Object-Oriented Data, shows how we can understand data, using a few elegant computational principles. Modern programming languages, particularly object-oriented programming languages, use introspective data (ie, the data with which data objects describe themselves) to modify the execution of a program at run-time; an elegant process known as reflection. Using introspection and reflection, programs can integrate data objects with related data objects. The implementations of introspection, reflection and integration, are among the most important achievements in the field of computer science.
Chapter 8, Problem Simplification, demonstrates that it is just as important to simplify problems as it is to simplify data. This final chapter provides simple but powerful methods for analyzing data, without resorting to advanced mathematical techniques. The use of random number generators to simulate the behavior of systems, and the application of Monte Carlo, resampling, and permutative methods to a wide variety of common problems in data analysis, will be discussed. The importance of data reanalysis, following preliminary analysis, is emphasized.
TABLE OF CONTENTS Chapter 0. Preface References for Preface Glossary for Preface Chapter 1. The Simple Life Section 1.1. Simplification drives scientific progress Section 1.2. The human mind is a simplifying machine Section 1.3. Simplification in Nature Section 1.4. The Complexity Barrier Section 1.5. Getting ready Open Source Tools for Chapter 1 Perl Python Ruby Text Editors OpenOffice Command line utilities Cygwin, Linux emulation for Windows DOS batch scripts Linux bash scripts Interactive line interpreters Package installers System calls References for Chapter 1 Glossary for Chapter 1 Chapter 2. Structuring Text Section 2.1. The Meaninglessness of free text Section 2.2. Sorting text, the impossible dream Section 2.3. Sentence Parsing Section 2.4. Abbreviations Section 2.5. Annotation and the simple science of metadata Section 2.6. Specifications Good, Standards Bad Open Source Tools for Chapter 2 ASCII Regular expressions Format commands Converting non-printable files to plain-text Dublin Core References for Chapter 2 Glossary for Chapter 2 Chapter 3. Indexing Text Section 3.1. How Data Scientists Use Indexes Section 3.2. Concordances and Indexed Lists Section 3.3. Term Extraction and Simple Indexes Section 3.4. Autoencoding and Indexing with Nomenclatures Section 3.5. Computational Operations on Indexes Open Source Tools for Chapter 3 Word lists Doublet lists Ngram lists References for Chapter 3 Glossary for Chapter 3 Chapter 4. Understanding Your Data Section 4.1. Ranges and Outliers Section 4.2. Simple Statistical Descriptors Section 4.3. Retrieving Image Information Section 4.4. Data Profiling Section 4.5. Reducing data Open Source Tools for Chapter 4 Gnuplot MatPlotLib R, for statistical programming Numpy Scipy ImageMagick Displaying equations in LaTex Normalized compression distance Pearson's correlation The ridiculously simple dot product References for Chapter 4 Glossary for Chapter 4 Chapter 5. Identifying and Deidentifying Data Section 5.1. Unique Identifiers Section 5.2. Poor Identifiers, Horrific Consequences Section 5.3. Deidentifiers and Reidentifiers Section 5.4. Data Scrubbing Section 5.5. Data Encryption and Authentication Section 5.6. Timestamps, Signatures, and Event Identifiers Open Source Tools for Chapter 5 Pseudorandom number generators UUID Encryption and decryption with OpenSSL One-way hash implementations Steganography References for Chapter 5 Glossary for Chapter 5 Chapter 6. Giving Meaning to Data Section 6.1. Meaning and Triples Section 6.2. Driving Down Complexity with Classifications Section 6.3. Driving Up Complexity with Ontologies Section 6.4. The unreasonable effectiveness of classifications Section 6.5. Properties that Cross Multiple Classes Open Source Tools for Chapter 6 Syntax for triples RDF Schema RDF parsers Visualizing class relationships References for Chapter 6 Glossary for Chapter 6 Chapter 7. Object-oriented data Section 7.1. The Importance of Self-explaining Data Section 7.2. Introspection and Reflection Section 7.3. Object-Oriented Data Objects Section 7.4. Working with Object-Oriented Data Open Source Tools for Chapter 7 Persistent data SQLite databases References for Chapter 7 Glossary for Chapter 7 Chapter 8. Problem simplification Section 8.1. Random numbers Section 8.2. Monte Carlo Simulations Section 8.3. Resampling and Permutating Section 8.4. Verification, Validation, and Reanalysis Section 8.5. Data Permanence and Data Immutability Open Source Tools for Chapter 8 Burrows Wheeler transform Winnowing and chaffing References for Chapter 8 Glossary for Chapter 8- Jules Berman
key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, jules j berman