Here is a preview of the contents:
TABLE OF CONTENTS Chapter 0. Preface References for Preface Glossary for Preface Chapter 1. The Simple Life Section 1.1. Simplification drives scientific progress Section 1.2. The human mind is a simplifying machine Section 1.3. Simplification in Nature Section 1.4. The Complexity Barrier Section 1.5. Getting ready Open Source Tools for Chapter 1 Perl Python Ruby Text Editors OpenOffice Command line utilities Cygwin, Linux emulation for Windows DOS batch scripts Linux bash scripts Interactive line interpreters Package installers System calls References for Chapter 1 Glossary for Chapter 1 Chapter 2. Structuring Text Section 2.1. The Meaninglessness of free text Section 2.2. Sorting text, the impossible dream Section 2.3. Sentence Parsing Section 2.4. Abbreviations Section 2.5. Annotation and the simple science of metadata Section 2.6. Specifications Good, Standards Bad Open Source Tools for Chapter 2 ASCII Regular expressions Format commands Converting non-printable files to plain-text Dublin Core References for Chapter 2 Glossary for Chapter 2 Chapter 3. Indexing Text Section 3.1. How Data Scientists Use Indexes Section 3.2. Concordances and Indexed Lists Section 3.3. Term Extraction and Simple Indexes Section 3.4. Autoencoding and Indexing with Nomenclatures Section 3.5. Computational Operations on Indexes Open Source Tools for Chapter 3 Word lists Doublet lists Ngram lists References for Chapter 3 Glossary for Chapter 3 Chapter 4. Understanding Your Data Section 4.1. Ranges and Outliers Section 4.2. Simple Statistical Descriptors Section 4.3. Retrieving Image Information Section 4.4. Data Profiling Section 4.5. Reducing data Open Source Tools for Chapter 4 Gnuplot MatPlotLib R, for statistical programming Numpy Scipy ImageMagick Displaying equations in LaTex Normalized compression distance Pearson's correlation The ridiculously simple dot product References for Chapter 4 Glossary for Chapter 4 Chapter 5. Identifying and Deidentifying Data Section 5.1. Unique Identifiers Section 5.2. Poor Identifiers, Horrific Consequences Section 5.3. Deidentifiers and Reidentifiers Section 5.4. Data Scrubbing Section 5.5. Data Encryption and Authentication Section 5.6. Timestamps, Signatures, and Event Identifiers Open Source Tools for Chapter 5 Pseudorandom number generators UUID Encryption and decryption with OpenSSL One-way hash implementations Steganography References for Chapter 5 Glossary for Chapter 5 Chapter 6. Giving Meaning to Data Section 6.1. Meaning and Triples Section 6.2. Driving Down Complexity with Classifications Section 6.3. Driving Up Complexity with Ontologies Section 6.4. The unreasonable effectiveness of classifications Section 6.5. Properties that Cross Multiple Classes Open Source Tools for Chapter 6 Syntax for triples RDF Schema RDF parsers Visualizing class relationships References for Chapter 6 Glossary for Chapter 6 Chapter 7. Object-oriented data Section 7.1. The Importance of Self-explaining Data Section 7.2. Introspection and Reflection Section 7.3. Object-Oriented Data Objects Section 7.4. Working with Object-Oriented Data Open Source Tools for Chapter 7 Persistent data SQLite databases References for Chapter 7 Glossary for Chapter 7 Chapter 8. Problem simplification Section 8.1. Random numbers Section 8.2. Monte Carlo Simulations Section 8.3. Resampling and Permutating Section 8.4. Verification, Validation, and Reanalysis Section 8.5. Data Permanence and Data Immutability Open Source Tools for Chapter 8 Burrows Wheeler transform Winnowing and chaffing References for Chapter 8 Glossary for Chapter 8
Over the next few weeks, I will be blogging on topics selected from Data Simplification: Taming Information With Open Source Tools. I hope I can convince you that this is a book worth reading.
- Jules Berman
key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, jules j berman