Monday, December 29, 2008

Perl, Ruby and Python scripts for CDC public use mortality file parsing

I have been interested in knowing whether sickle cell incidence is decreasing in the U.S. population. Despite Pubmed and web searches, I have not been able to find a single data source on the subject. In a prior post, I referred to one of my Perl scripts that
parsed through CDC public use records (about 5 Gigabytes of raw data). The results seemed to suggest that the incidence of sickle cell anemia in the U.S. may be increasing.

For those interested, here are Perl, Ruby and Python versions of the script that parses through the CDC public use mortality files for the years 1996, 1999, 2002, and 2004, to produce the number of occurrences and the rates for sickle cell disease cases, from death certificate data, for those years.

Please refer to the prior post for discussion of the data.

Please refer to my document describing uses for the CDC public use mortality files for general instructions on acquiring and analyzing the de-identified U.S. death certificate data.

#!/usr/local/bin/perl
@filearray = qw(mort96us.dat mort99us.dat mort02us.dat mort04us.dat);
foreach $file (@filearray)
{
open (ICD, $file);
$line = " ";
$popcount = 0;
$counter = 0;
while ($line ne "")
{
$line = <ICD>;
$codesection = substr($line,448,140) if ($file eq $filearray[0]);
$codesection = substr($line,161,140) if ($file eq $filearray[1]);
$codesection = substr($line,162,140) if ($file eq $filearray[2]);
$codesection = substr($line,164,140) if ($file eq $filearray[3]);
$popcount++;
if ($codesection =~ /D57/i)
{
$counter++;
}
}
close ICD;
$rate = $counter / $popcount;
$rate = substr((100000 * $rate),0,5);
print "\n\nRecords listing sickle cell is $counter in $file file";
print "\nSickle cell rate per 100,000 records is $rate in $file file";
}
exit;

#!/usr/local/bin/ruby
filearray = Array.new
filearray = "mort96us.dat mort99us.dat mort02us.dat mort04us.dat".split
filearray.each do
|file|
text = File.open(file, "r")
counter = 0; popcount = 0;
text.each_line do
|line|
codesection = line[448,140] if (file == filearray.fetch(0))
codesection = line[161,140] if (file == filearray.fetch(1))
codesection = line[162,140] if (file == filearray.fetch(2))
codesection = line[164,140] if (file == filearray.fetch(3))
popcount = popcount +1
counter = (counter + 1) if (codesection =~ /D57/i)
end
text.close
rate = ((counter.to_f / popcount.to_f) * 100000).to_s[0,5]
puts "\nRecords listing sickle cell is #{counter} in #{file} file"
puts "Sickle cell rate per 100,000 records is #{rate} in #{file} file"
end
exit

#!/usr/local/bin/python
import re
sickle_match = re.compile('D57')
lst = ("mort96us.dat","mort99us.dat","mort02us.dat","mort04us.dat")
for file in lst:
intext = open(file, "r")
popcount = 0
counter = 0
codesection = ""
for line in intext:
if file == lst[0]:
codesection = line[448:588]
if file == lst[1]:
codesection = line[161:301]
if file == lst[2]:
codesection = line[162:302]
if file == lst[3]:
codesection = line[164:304]
popcount = popcount + 1
p = sickle_match.search(codesection)
if p:
counter = counter + 1
intext.close
rate = float(counter) / float(popcount) * 100000
rate = str(rate)
rate = rate[0:5]
print ('\n\nRecords listing sickle cell is ')
print (str(counter) + ' in ' + file + ' file')
print ('\nSickle cell rate per 100,000 records is ')
print(str(rate) + ' in ' + file + ' file')
exit

© 2008 Jules Berman

As with all of my scripts, lists, web sites, and blog entries, the following disclaimer applies. This material is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the material or the use or other dealings.

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: common disease, orphan disease, orphan drugs, rare disease, disease genetics, sickle cell anemia, genetic counseling, public health, centers for disease control and prevention, international classification of diseases, cause of death, causes of death, epidemiology, Perl programming, Ruby programming, Python programming