Here is a preview of the contents:
TABLE OF CONTENTS
Chapter 0. Preface
References for Preface
Glossary for Preface
Chapter 1. The Simple Life
Section 1.1. Simplification drives scientific progress
Section 1.2. The human mind is a simplifying machine
Section 1.3. Simplification in Nature
Section 1.4. The Complexity Barrier
Section 1.5. Getting ready
Open Source Tools for Chapter 1
Perl
Python
Ruby
Text Editors
OpenOffice
Command line utilities
Cygwin, Linux emulation for Windows
DOS batch scripts
Linux bash scripts
Interactive line interpreters
Package installers
System calls
References for Chapter 1
Glossary for Chapter 1
Chapter 2. Structuring Text
Section 2.1. The Meaninglessness of free text
Section 2.2. Sorting text, the impossible dream
Section 2.3. Sentence Parsing
Section 2.4. Abbreviations
Section 2.5. Annotation and the simple science of metadata
Section 2.6. Specifications Good, Standards Bad
Open Source Tools for Chapter 2
ASCII
Regular expressions
Format commands
Converting non-printable files to plain-text
Dublin Core
References for Chapter 2
Glossary for Chapter 2
Chapter 3. Indexing Text
Section 3.1. How Data Scientists Use Indexes
Section 3.2. Concordances and Indexed Lists
Section 3.3. Term Extraction and Simple Indexes
Section 3.4. Autoencoding and Indexing with Nomenclatures
Section 3.5. Computational Operations on Indexes
Open Source Tools for Chapter 3
Word lists
Doublet lists
Ngram lists
References for Chapter 3
Glossary for Chapter 3
Chapter 4. Understanding Your Data
Section 4.1. Ranges and Outliers
Section 4.2. Simple Statistical Descriptors
Section 4.3. Retrieving Image Information
Section 4.4. Data Profiling
Section 4.5. Reducing data
Open Source Tools for Chapter 4
Gnuplot
MatPlotLib
R, for statistical programming
Numpy
Scipy
ImageMagick
Displaying equations in LaTex
Normalized compression distance
Pearson's correlation
The ridiculously simple dot product
References for Chapter 4
Glossary for Chapter 4
Chapter 5. Identifying and Deidentifying Data
Section 5.1. Unique Identifiers
Section 5.2. Poor Identifiers, Horrific Consequences
Section 5.3. Deidentifiers and Reidentifiers
Section 5.4. Data Scrubbing
Section 5.5. Data Encryption and Authentication
Section 5.6. Timestamps, Signatures, and Event Identifiers
Open Source Tools for Chapter 5
Pseudorandom number generators
UUID
Encryption and decryption with OpenSSL
One-way hash implementations
Steganography
References for Chapter 5
Glossary for Chapter 5
Chapter 6. Giving Meaning to Data
Section 6.1. Meaning and Triples
Section 6.2. Driving Down Complexity with Classifications
Section 6.3. Driving Up Complexity with Ontologies
Section 6.4. The unreasonable effectiveness of classifications
Section 6.5. Properties that Cross Multiple Classes
Open Source Tools for Chapter 6
Syntax for triples
RDF Schema
RDF parsers
Visualizing class relationships
References for Chapter 6
Glossary for Chapter 6
Chapter 7. Object-oriented data
Section 7.1. The Importance of Self-explaining Data
Section 7.2. Introspection and Reflection
Section 7.3. Object-Oriented Data Objects
Section 7.4. Working with Object-Oriented Data
Open Source Tools for Chapter 7
Persistent data
SQLite databases
References for Chapter 7
Glossary for Chapter 7
Chapter 8. Problem simplification
Section 8.1. Random numbers
Section 8.2. Monte Carlo Simulations
Section 8.3. Resampling and Permutating
Section 8.4. Verification, Validation, and Reanalysis
Section 8.5. Data Permanence and Data Immutability
Open Source Tools for Chapter 8
Burrows Wheeler transform
Winnowing and chaffing
References for Chapter 8
Glossary for Chapter 8
Over the next few weeks, I will be blogging on topics selected from Data Simplification: Taming Information With Open Source Tools. I hope I can convince you that this is a book worth reading.
- Jules Berman
key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, jules j berman

No comments:
Post a Comment