Thursday, December 18, 2008

CDC Mortality Data: 10

This is the tenth in a series of posts on the CDC's (Centers for Disease Control and Prevention) public use mortality data sets. In yesterday's blog in this series, we reviewed the steps taken to create our coccidioidomycosis mashup.

In today's blog, we'll look at how we can tweak our scripts to combine data from different data lists, to change occurrence data to incidence data, and to enhance the visual results.

Let's skip ahead in the discussion and look at the final mashup map, which contains incidence data for two different fungal diseases: coccidioidomycosis and histoplasmosis.

The red circles represent the coccidioidomycosis rates of infection, as recorded on death certificates. The blue circles represent histoplasmosis rates.

The map shows that coccidioiodomycosis is endemic in the Southwest U.S. Histoplasmosis is endemic in the Eastern half of the U.S. It took under a minute to generate this mashup map, parsing over one gigabyte of de-identified death certificate data records, extracting data on occurrences of both diseases and the states in which they were recorded, and producing a visual output that conveys a detailed epidemiologic study that can be understood at a glance.

To achieve this output, we simply tweaked the two scripts discussed in prior blogs in this series: the perl script that collects cases based on diagnosis; and the ruby script that draws circles on an outline map.

Conveniently, ICD codes for conditions resulting from C. immitis infection (coccidioidomycosis) all begin with B38. Conditions for H. capsulatum (histoplasmosis) begin with B39. Virtually the same code collects occurrences of either disease. The key difference between today's graphic output and that shown in an earlier blog in this series is that today's graph shows the rate of occurrence in the different states, not just total number of occurrences. By "rate", we refer to the number of occurrences divided by the at-risk population. As shown in the code below, as the script parses through the ICD data records, it collects the number of records from each state (keeping a state record tally in the %state_total hash) and the number of occurrences in each state (keeping a state disease tally in the %state_disease hash).

while ($line ne "")
$line = <ICD>;
$state = 0;
$code = substr($line,20,2);
$state = $statehash{$code};
$codesection = substr($line,161,140);
if ($codesection =~ /B39/)
open (DATAFILE, ">state_count.txt");
while ((my $key, my $value) = each(%state_disease))
$rate = int(($value / $state_total{$key})*50000);
print "$key $value $rate $state_total{$key}\n";
print DATAFILE "$key $rate\n";

The resulting rate is determined in the following line:

$rate = int(($value / $state_total{$key})*50000);

We note that the ratio is multiplied by the number 50000. We multiply the ratio by this number to produce an output that produces a graphically pleasing circle size. There is no scientific significance to the number 50000. The circles are proportionate symbols, not measurements.

What is the difference between a disease rate and a disease incidence? Incidence is a formal epidemiologic concept represented by cases perl 100,000 members of a living population. I believe that the way we have treated the data (death certificate occurrences of disease as a proportion of all death certificate records) is a valid way to represent the data, but if you want to work with living populations, you will need to use state population data. The CDC data dictionary for the mortality files provides this information.

If you want to go one step further, and obtain age-adjusted incidence rates, you'll need to extract patient ages for the occurrences of disease (a simple process, as age is provided in the mortality data sets) and use the stratified population age data tables, also provided in the CDC data dictionary file.

The data dictionary file is available by anonymous ftp from at the following subdirectory:


As I remind readers in almost every blog post, if you want to do your own creative data mining, you will need to learn a little about computer programming.

For Perl, Ruby, or Python programmers, methods and scripts for using a wide range of publicly available biomedical databases, are described in detail in my prior books:

Perl Programming for Medicine and Biology

Ruby Programming for Medicine and Biology

Methods in Medical Informatics: Fundamentals of Healthcare Programming in Perl, Python, and Ruby

An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.

More information on cancer is available in my recently published book, Neoplasms: Principles of Development and Diversity.

© 2008 Jules Berman

As with all of my scripts, lists, web sites, and blog entries, the following disclaimer applies. This material is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the material or the use or other dealings.

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

I urge you to read more about my book. There's a generous preview of the book at the Google Books site.

tags: common disease, orphan disease, orphan drugs, rare disease, disease genetics, epidemiology, perl, Ruby