Tuesday, December 2, 2008

CDC Mortality data: 1

I thought I would do a new set of blogs devoted to the CDC (US Center for Disease Control and Prevention) mortality data.

This amazing data set is greater than 1 Gigabyte in length and contains individual de-identified (and sometimes ambiguated) records for U.S. deaths occurring in 1999.

With access to this file, an informaticist/epidemiologist can glean a wealth of information related to the immediate, underlying, and contributing causes of death in the U.S.

Here are just some of the incredible features of the data set:

Because each record contains multiple conditions related to the death of individiduals or present in the individual at the time of death, it is possible to draw inferences about the relationship among the different conditions, and the likeilhood of co-existences among conditions.

Because demographic information is provided, it is possible to determine the frequency of occurrence of conditions in age groups, ethnic groups, localities, and genders.

Because the organization of records meticulously preserves the order and organization of the original death certificate, it is possible to relate conditions by their order of causation (which conditions lead to which other conditions).

Because the disease conditions are coded with an international standard for diseases (ICD10), all of the disease entries can be understood and correlated with terms from any other data set, coded with the same nomenclature.

Because there are over 2.3 million records in the dataset, it is possible to find significant numbers of cases for hundreds of different conditions.

Because every record conforms to a consistent organization, it is possible to re-organize and merge these records with data from other sources, increasing the value of the original data.

The key file that we will be using is available by anonymous ftp from the CDC server:

ftp.cdc.gov
/pub/Health_Statistics/NCHS/Datasets/mortality
Mort99us.zip (88,077,536 bytes)

This file unzips to:
Mort99us.dat (1,058,532,982 bytes)

Each record loods something like this:

0 11019993630101999999913630103299115401 10111073402009 6 1010075 990999 99999 199901015010150450 009 7 J449267000860622800511J969 12J449 61E109 62I709 63I500 03 C259 E149 I10

Note that this is a composite record selected from string sequences in several different records. I didn't think that it would be necessary or appropriate for me to publish an actual record from the CDC file.

Looking at the record, a seeming jumble of alphanumerics and spaces, you might conclude that extracting any useful information would be a formidable task, well beyond the capacity of non-specialists. Actually, all of the data in the 1+ GByte file can be parsed, re-assembled, and analyzed in a matter of seconds, with a few lines of code that anyone can understand and implement.

In the next week or so, I will show you, step-by-step, how to master the CDC mortality file, using free, open source utilities and a few lines of Perl.

The process is no different than making a cake from a recipe: you assemble your ingredients, follow a series of steps, wait a few moments for the cake to cook, and enjoy the results. In the next blog, I'll show you the ingredients that you must assemble.

As I remind readers in almost every blog post, if you want to do your own creative data mining, you will need to learn a little computer programming.

For Perl and Ruby programmers, methods and scripts for using SEER and other publicly available biomedical databases, are described in detail in my prior books:

Perl Programming for Medicine and Biology

Ruby Programming for Medicine and Biology

An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.

More information on cancer is available in my recently published book, Neoplasms.

© 2008 Jules Berman

As with all of my scripts, lists, web sites, and blog entries, the following disclaimer applies. This material is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the material or the use or other dealings in the material.


My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, neoplasms