Saturday, December 13, 2008

CDC Mortality Data: 4

This is the fourth in the series of posts on the CDC's (Centers for Disease Control and Prevention) public use mortality data sets.

In the third blog of this series, we learned how the underlying causes of death are listed on a prototypical death certificate.

In this blog, we'll discuss how the data on every death certificate is transformed into a mortality record consisting of an alphanumeric sequence.

We will use the large (1+ gigabyte) CDC public mortality file, and its data dictionary. Access to these files was described in an earlier blog from this series.

The files are:

Mort99us.dat (1,058,532,982 bytes)

and

Mort99do.pdf (4,911,017 bytes)

We could have just as eaily used mortality data from other years... they're also available from the CDC ftp site.

The Mort99us.dat file consists of millions of records, one record per line, each line consisting of a long string of alphanumerics.

The portion of the line-record sequence that we are most interested in is the stretch of alphnumerics extending from bytes 162 to 301.

The data dictionary file, on page 36, explains the significance of this stretch of
characters (see figure below).



Each 7-digit piece of this stretch of characters represents another diagnosis. There can be as many as 20 7-digit fragments in the 140 bytes from position 162 to 301.

Each 7-digit subsequence consists of:

First character. Line indicator: The first byte represents the line of the death certificate on which the code appears (discussed in a prior series blog). Six lines (1-6) are allowable with the fourth and fifth denoting an additional condition was written in beyond the four lines provided in Part I of the U.S. Standard Certificate of Death. Line “6" represents Part II of the death certificate.

Second character. Position indicator: The next byte indicates the position of the code on the line, i.e., it is the first (1), second (2), third (3) .... eighth (8) code on the line.

Third through sixth character. These four bytes represent the ICD-10 (International Classification of Disease, version 10) code.

Seventh character. The seventh and last byte is blank.

This protocol permits us to capture all of the information conveyed in the cause of death section from the death certificate, including line number and number of causes on the line. The highest numbered cause of death line number (5 is the highest permissible number) indicates the underlying cause of death that leads, ultimately, to the proximate cause of death.

An example of a cause of death record sequence is:

11I219 21I251 61I500 62R54

In this example, there are two causes of death:

11I219 (first line, first condition on line, ICD diagnosis I219)

and

21I251 (second line, first condition on line, ICD diagnosis I251)

In addition, there are two medical conditions that the doctor listed as "other significant conditions" that were not listed with the underlying causes of death. These are always designated with a "6")

61I500 ("other significant condition" list, item one, ICD code I500)

62R54 ("other significant condition" list, item two, ICD code R54)

The file does not tell us the term-equivalent of the listed codes.

For this, we need to use an ICD10 dictionary.

I219 = (I21.9, Acute myocardial infarction unspecified)
I251 = (I25.1 Atherosclerotic heart disease)
I500 = (I50.0 Congestive heart failure)
R54 = (R54 Senility)

Notice that the actual ICD10 codes containe a dot, and the CDC mortality sequence did not.

So, now we see the full picture of the cause of death section of the death certificate. Atherosclerotic heart disease was considered the underlying cause of death. Acute myocardial infarction was considered the proximate cause of death. Congestive heart failure and senility was considered "other significant conditions."

What file did we use to find the meaning of the ICD codes? As described in an earlier post, the CDC site contains several files containing ICD codes. I used the files each10.txt and i10idx0707.pdf.

If you have the patience, you can open and view these files and look up ICD10 codes, one at a time, until you've found all of the cause of death terms for all of the millions of cases included in the CDC mortality files. Then you can start to analyze your results.

I don't recommend this method. Parsing through files and attaching terms to codes is something that computers should do for you. Unfortunately, there are two problems that will give us a little extra work, before we can write a script that parses the CDC mortality files.

The first problem is the computer unfriendly format of the native ICD10 files. The people who created ICD10 did so with the rather limited view that ICD10 coding is an inherently human task. They used human-friendly organization techniques (such as organization through indentation, and the use of partial term names in cases where the term is subsumed, in part, by a preceding term. This means that the ICD10 definition files cannot be immediately parsed by a computer without some preliminary modification.

The second problem is the proprietary nature of the ICD10 files. ICD10 is copyrighted to the World Health Organization. This means that individuals cannot post copies of ICD10 dictionaries to the web. Nonetheless, you can find ICD10 code files posted to the web, in apparent violation of copyright. As far as I can tell, all of the readily downloadable versions are incomplete.

So, if we want to do some work with the CDC public use mortality files, we will need to write a script that extracts information from existing publicly available ICD10 files, to produce our own, personal, computer-parsable file that is larger than any single available file, and that will contain most or all of the codes contained in sequence 162 - 301 of the mortality data files.

In the next blog in this series, I will take you through all the steps to produce an ICD10 dictionary file, that anyone can use to parse and translate data from the public use CDC mortality files.

As I remind readers in almost every blog post, if you want to do your own creative data mining, you will need to learn a little computer programming.

For Perl and Ruby programmers, methods and scripts for using a wide range of publicly available biomedical databases, are described in detail in my prior books:

Perl Programming for Medicine and Biology

Ruby Programming for Medicine and Biology

An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.

More information on cancer is available in my recently published book, Neoplasms.

© 2008 Jules Berman

As with all of my scripts, lists, web sites, and blog entries, the following disclaimer applies. This material is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the material or the use or other dealings in the material.


Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.