Sunday, December 21, 2008


I've collected all of my recent blogs on the CDC (Centers for Disease Control and Prevention) mortality data sets and put them into a single pdf file for easy reading.

All of the included techniques, scripts, tools and data sets are open source. They include:

1. Methods for accessing publicly available de-identified records collected from death certficates

2. Methods to parse and analyze the CDC data sets

3. Methods for compiling an ICD10 (International Classification of Diseases version 10) data dictionary from publicly available sources

4. Methods for creating map mashups from publicly available data sets

5. Script examples (mostly in Perl) of the kinds of questions that can be answered with the public use data sets.

The report, intended for biomedical researchers who have some programming knowledge (preferably Perl, Ruby or Python), is about 3 Megabytes in length. It is available, at no cost, from:

-Jules J. Berman
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, epidemiology