The CDC data sets are available by anonymous ftp from the ftp.cdc.gov server. Most browsers come equipped for the ftp protocol, and you can just enter the ftp address much as you might enter a http protocol web address.
The address for the mortality data sit is:
ftp.cdc.gov/pub/Health_Statistics/NCHS
/Datasets/DVS/mortality/
The image below indicates how my Mozilla browser produces the mortality data set subdirectory, when the full ftp address is entered in the location bar. You can click on the image to see a larger version.
For medical data miners, this is one of the most important web sites in existence. When these files are unzipped, they provide an aggregate database of de-identified records, collected over several decades, of information culled from many millions of death certificates. This site alone can keep a medical informaticist busy and productive for his or her entire career. There is no limit to the utility of this site, when its data is merged with data from other biomedical resources.
We will be using the 1999 data file, Mort1999us.zip (88,077,536 bytes)
The other files we need, from the CDC's ftp site are:
ftp.cdc.gov/pub/Health_Statistics/NCHS
/Publications/ICD10/each10.txt
ftp.cdc.gov/pub/Health_Statistics/NCHS
/Publications/ICD10CM/2007/i10idx0707.zip
ftp.cdc.gov/pub/Health_Statistics/NCHS
/Dataset_Documentation/mortality/Mort99doc.pdf
We also need a open source programming language, such as Perl, Ruby or Python. For this example, I'll be using very short Perl scripts. If you know Ruby or Python, you should have no trouble converting the scripts to Ruby or Python.
Two open source command line utilities that would be helpful, but not strictly necessary (I'll explain why in a later blog) are pdftk and xpdf.
I use these utilities for manipulating pdf files (extracting ranges of pages, or images, or converting pdf files to text files).
They are available for download from the following web sites.
http://www.accesspdf.com/pdftk/
http://www.foolabs.com/xpdf/download.html
In the next blog, I will explain how to build a dictionary of ICD10 (International Classification of Diseases version 10) codes and terms (from the each10.txt file and the i10idx0707.zip file vida supra). We will need this code dictionary to interpret the coded disease entries in the CDC Mortality files.
As I remind readers in almost every blog post, if you want to do your own creative data mining, you will need to learn a little computer programming.
For Perl and Ruby programmers, methods and scripts for using a wide range of publicly available biomedical databases, are described in detail in my prior books:
Perl Programming for Medicine and Biology
Ruby Programming for Medicine and Biology
An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.
More information on cancer is available in my recently published book, Neoplasms.
© 2008 Jules Berman
As with all of my scripts, lists, web sites, and blog entries, the following disclaimer applies. This material is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the material or the use or other dealings in the material.
In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.
I urge you to read more about my book. There's a generous preview of the book at the Google Books site.
tags: biology of rare diseases, common diseases, genetic disease, disease genetics, orphan diseases, orphan drugs, rare disease organizations, rare disease research, rare diseases,