Sunday, December 14, 2008

CDC Mortality Data: 5

This is the fifth in a series of posts on the CDC's (Centers for Disease Control and Prevention) public use mortality data sets.

In the fourth blog of this series, we learned how the cause of death data from death certificate records is is transformed into a mortality record consisting of an alphanumeric sequence. The causes of death are represented by ICD codes. If we have a computer computer-parsable list of ICD codes, we can write a short program that assigns human-readable terms (full names of diseases) to the codes in the mortality files.

Let's start with the each10.txt file, available by anonymous ftp from the ftp.cdc.gov web server at:

/pub/Health_Statistics/NCHS/Publications/ICD10/each10.txt

Here are the first few lines of this file:

A00Cholera
A00.0Cholera due to Vibrio cholerae 01, biovar cholerae
A00.1Cholera due to Vibrio cholerae 01, biovar el tor
A00.9Cholera, unspecified
A01Typhoid and paratyphoid fevers
A01.0Typhoid fever
A01.1Paratyphoid fever A
A01.2Paratyphoid fever B
A01.3Paratyphoid fever C
A01.4Paratyphoid fever, unspecified
A02Other salmonella infections
A02.0Salmonella gastroenteritis

There are 9,320 terms in the each10.txt file, sufficient for many purposes. However, the entries in each10.txt were selected from a much larger collection of ICD10 terms. The terms in the each10.txt file are "cause of death" concepts. They will match causes of death found on Death Certificates. However, Death Certificates (and hence the public use CDC mortality data sets) include "other significant conditions" in addition to causes of death (discussed in yesterday's blog). If we want to find the meaning of all of the conditions contained in the CDC mortality files, we need to supplement the each10.txt file with additional ICD10 entries.

More ICD10 entries are found in the i10idx0707.pdf file, a large index of ICD10 terms and their codes. This file is also available by anonymous ftp from the CDC server at:

ftp.cdc.gov
/pub/Health_Statistics/NCHS/Publications/ICD10CM/2007/i10idx0707.zip

Download and unzip this file (a freeware unzip utility is available at http://www.7zip.com/)

The unzipped file is an Adobe Acrobat pdf file 1,344 pages in length. An excerpt from the first page of the .pdf file is shown:



We need to convert this .pdf file into a .txt fie, if we expect to parse through the file and extract codes and matching terms. For most .pdf files, you can simply select, cut, and paste pages into a .txt file. This is a tricky method to use for very large files, because it requires a lot of memory. Sometimes the textual output from .pdf files is garbled or contains errors. Some .pdf files do not support cut and paste operations.

I like to convert .pdf files to .txt files in a two-step operation using free, open source command-line pdf utilities: pdftk and xpdf.

pdftk is available at: http://www.accesspdf.com/pdftk/
The zipped .exe file is pdftk-1.12.exe.zip

xpdf is available at: http://www.foolabs.com/xpdf/download.html
The zipped .exe file is xpdf-3.02pl2-win32.zip

Many .pdf files come in an internally compressed format. The compression is not apparent to the user. There is no special file extension, and the Adobe Reader software decompresses the file seamlessly. Before converting compressed .pdf files to text, we need to decompress the file.

From the command line in pdftk subdirectory, uncompress the compressed .pdf file with the following command. Remember to copy the i10idx0707.pdf file to the pdftk subdirectory.

C:\pdftk>pdftk i10idx0707.pdf output mydoc_un.pdf uncompress

Now we have an uncompressed .pdf file, mydoc_un.pdf

From the xpdf subdirectory, we can convert the uncompressed pdf file to a .txt file. Remember to copy the mydoc_un.pdf file to the pdftk subdirectory.

C:\xpdf>pdftotext mydoc_un.pdf mydoc.txt

This produces a text (ASCII) version of the icd10 index file:

MYDOC.TXT (2,360,138 bytes)

At last, we have two .txt files (mydoc.txt and each10.txt) that we can use, together, to create a clean, computer-parsable list of ICD codes and their equivalent terms, in English.

Here is the Perl script that does the job. You will need to place the mydoc.txt file and the each10.txt file in the same subdirectory as the Perl script.

#!/usr/bin/perl
open(TEXT, "mydoc.txt");
undef($/);
$var = <TEXT>;
close TEXT;
$var =~ tr/\14//d;
$var =~ s/ ([A-Z][0-9][0-9\.]*[0-9])/ $1\n/g;
open(TEXT, ">mydoc.out");
print TEXT $var;
close TEXT;
$/ = "\n";
open(TEXT, "mydoc.out");
$line = " ";
while ($line ne "")
{
$line = <TEXT>;
if ($line =~ /\b([A-Z][0-9][0-9\.]*[0-9]) *$/)
{
$term = $`;
$code = $1;
$term =~ s/^[ \-]*//o;
$term =~ s/[ \-]*$//o;
$term = lc($term);
$line =~ tr/\173//d;
$dictionary{$code} = $term;
}
}
close TEXT;
open (ICD, "each10.txt")||die"cannot";
undef($/);
$line = <ICD>;
$line =~ tr/\000-\011//d;
$line =~ tr/\013-\014//d;
$line =~ tr/\016-\037//d;
$line =~ tr/\041-\055//d;
$line =~ tr/\173-\377//d;
@linearray = split(/\n(?=[ ]*[A-Z][0-9\.]{1,5})/, $line);
foreach $thing (@linearray)
{
if ($thing =~ /^ *([A-Z][0-9\.]{1,5}) ?/)
{
$code = $1;
$term = $';
$term =~ s/\n//;
$term =~ s/[ ]+$//;
$dictionary{$code} = $term;
}
}
unlink("mydoc.out");
open (TEXT, ">mydoc.out");
foreach $key (sort keys %dictionary)
{
print TEXT "$key $dictionary{$key}\n";
}
close TEXT;
exit;

The output file is: mydoc.out (1,091,342 bytes). It contains about 23,000 code/term pairs

I renamed the mydoc.out file, icd10_pl.txt. We will use this file in the next Perl script, to determine the total number of each condition appearing in the 1+ Gbyte mort99us.dat file. You will need to placed the icd10_pl.txt file and the Mort99us.dat file in the same subdirectory as this Perl script.

#/usr/local/bin/perl
open (ICD, "icd10_pl.txt");
$line = " ";
while ($line ne "")
{
$line = <ICD>;
if ($line =~ /^([A-Z][0-9\.]+) +/)
{
$code = $1;
$term = $';
$code =~ s/\.//o;
$term =~ s/\n//;
$term =~ s/ +$//;
$dictionary{$code} = $term;
}
}
close ICD;
open (ICD, "Mort99us.dat");
$line = " ";
print "\n\n";
while ($line ne "")
{
$line = <ICD>;
$codesection = substr($line,161,140);
@codearray = split(/ +/,$codesection);
foreach $code (@codearray)
{
$code =~ /[A-Z][0-9]+/;
$code = $&;
$counter{$code}++;
}
}
close ICD;
open (OUT, ">cdc.out");
while ((my $key, my $value) = each(%counter))
{
$value = "000000" . $value;
$value = substr($value,-6,6);
push(@filearray, "$value $key $dictionary{$key}");
}
$outfile = join("\n", reverse(sort(@filearray)));
print OUT $outfile;
exit

On my 2.5 GHz CPU desktop computer, it takes well under a minute to parse through the 1+ Gbyte CDC mortality data set and produce the desired output file (cdc.out). The total number of records parsed by the script were 2394871. There are 5,650 conditions included in the 1999 CDC mortality data set.

The first 45 lines of the output file are:

412827 I251 Atherosclerotic heart disease
352559 I469 Cardiac arrest unspecified
273644 I500 Congestive heart failure
244162 I219 Acute myocardial infarction unspecified
210394 J449 Chronic obstructive pulmonary disease unspecified
206996 J189 Pneumonia unspecified
203906 I10 Essential primary hypertension
176834 I64 Stroke not specified as hemorrhage or infarction
162128 C349 Bronchus or lung unspecified
149777 E149 Without complications
143326 J969 Respiratory failure unspecified
129947 A419 Septicemia unspecified
115504 F03 Unspecified dementia
106137 N19 Unspecified renal failure
101224 I250 Atherosclerotic cardiovascular disease so described
075834 G309 Alzheimers disease unspecified
073365 I709 Generalized and unspecified atherosclerosis
072931 R092 Respiratory arrest
067525 I499 Cardiac arrhythmia unspecified
067055 I48 Atrial fibrillation and flutter
065657 C80 Malignant neoplasm without specification of site
056718 J690 Pneumonitis due to food and vomit
056425 C189 Colon unspecified
051698 C509 Breast unspecified
045707 C61 Malignant neoplasm of prostate
043191 I429 Cardiomyopathy unspecified
038127 N189 Chronic renal failure unspecified
038072 E119 Without complications
037810 I119 Hypertensive heart disease without congestive heart failure
036801 I739 Peripheral vascular disease unspecified
036151 D649 Anemia unspecified
035720 N390 Urinary tract infection site not specified
035552 K922 Gastrointestinal hemorrhage unspecified
035133 J439 Emphysema unspecified
031981 K746 Other and unspecified cirrhosis of liver
031327 E86 Volume depletion
031266 G20 Parkinsons disease
031050 N180 Endstage renal disease
030693 N179 Acute renal failure unspecified
030442 R99 Other illdefined and unspecified causes of mortality
030416 C259 Pancreas unspecified
030248 K729 Hepatic failure unspecified
028458 I255 Ischemic cardiomyopathy
026860 I269 Pulmonary embolism without mention of acute cor pulmonale
026445 I509 Heart failure unspecified

The top line is:

412827 I251 Atherosclerotic heart disease

It indicates that atherosclerotic heart disease is the most common condition listed in the death certificates in 1999 in the U.S. It was listed 412,827 times. The ICD10 code for Atherosclerotic heart disease is I25.1.

Some of the output lines do not seem particularly helpful. For example:

056425 C189 Colon unspecified
051698 C509 Breast unspecified

Nobody dies from "Colon unspecified." The strange diagnosis is explained by the rather unsatisfactory way that the ICD assigns terms to codes. In this case, "Colon unspecified" is a sub-term in the general category of "Neoplasms of the colon." We know this because all of the codes beginning with "C" (i.e., C189 and C509 in this case) are cancer codes. Whenever an ICD term appears un-informative, we can return to the ICD10_pl.txt file (created in yesterday's blog) and clarify its meaning by examining the root term for the sub-term.

In the next blog in this series, we will try a more ambitious project, using the CDC mortality data and the U.S. map in a Ruby mashup script.

As I remind readers in almost every blog post, if you want to do your own creative data mining, you will need to learn a little computer programming.

For Perl and Ruby programmers, methods and scripts for using a wide range of publicly available biomedical databases, are described in detail in my prior books:

Perl Programming for Medicine and Biology

Ruby Programming for Medicine and Biology

An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.

More information on cancer is available in my recently published book, Neoplasms.

© 2008 Jules Berman

As with all of my scripts, lists, web sites, and blog entries, the following disclaimer applies. This material is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the material or the use or other dealings in the material.


My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, neoplasms, nomeclature, Perl script