Specified Life: Perl script

Showing posts with label Perl script. Show all posts

Thursday, January 28, 2010

Scripts for fetching and testing web pages

Web pages are files (usually in HTML format) that reside on servers that accept HTTP requests from clients connected to the Internet. Browsers are software applications that send HTTP requests and display the received web pages. Using Perl, Python, or Ruby, you can automate HTTP requests. For each language, the easiest way to make an HTTP request is to use a module that comes bundled as a standard component of the language.

I've written very simple scripts, in Perl, Python, and Ruby, for fetching web files. The scripts, and an explanation of how they work, are available at:

http://www.julesberman.info/factoids/url_get.htm

Perl, Python and Ruby use their own external modules for HTTP transactions, and each language's module has its own peculiar syntax. Still, the basic operation is the same: your script initiates an HTTP request for a web file at a specific network address (the URL, or Uniform Resource Locator); a response is received; the web page is retrieved, if possible, and printed to the monitor. Otherwise, the response will contain some information indicating why the page could not be retrieved.

With a little effort, you can use these basic scripts to collect and examine a large number of web pages. With a little more effort, you can write your own spider software that searches for web addresses within web pages, and iteratively collects information from web pages within web pages.

© 2010 Jules J. Berman

key words: testing link, ruby programming, perl programming, python programming, bioinformatics, valid web page, web page is available, good http request, valid http request testing if web page exists, testing web links, jules berman, jules j berman, Ph.D., M.D.

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Sunday, December 14, 2008

CDC Mortality Data: 5

This is the fifth in a series of posts on the CDC's (Centers for Disease Control and Prevention) public use mortality data sets.

In the fourth blog of this series, we learned how the cause of death data from death certificate records is is transformed into a mortality record consisting of an alphanumeric sequence. The causes of death are represented by ICD codes. If we have a computer computer-parsable list of ICD codes, we can write a short program that assigns human-readable terms (full names of diseases) to the codes in the mortality files.

Let's start with the each10.txt file, available by anonymous ftp from the ftp.cdc.gov web server at:

/pub/Health_Statistics/NCHS/Publications/ICD10/each10.txt

Here are the first few lines of this file:

A00Cholera
A00.0Cholera due to Vibrio cholerae 01, biovar cholerae
A00.1Cholera due to Vibrio cholerae 01, biovar el tor
A00.9Cholera, unspecified
A01Typhoid and paratyphoid fevers
A01.0Typhoid fever
A01.1Paratyphoid fever A
A01.2Paratyphoid fever B
A01.3Paratyphoid fever C
A01.4Paratyphoid fever, unspecified
A02Other salmonella infections
A02.0Salmonella gastroenteritis

There are 9,320 terms in the each10.txt file, sufficient for many purposes. However, the entries in each10.txt were selected from a much larger collection of ICD10 terms. The terms in the each10.txt file are "cause of death" concepts. They will match causes of death found on Death Certificates. However, Death Certificates (and hence the public use CDC mortality data sets) include "other significant conditions" in addition to causes of death (discussed in yesterday's blog). If we want to find the meaning of all of the conditions contained in the CDC mortality files, we need to supplement the each10.txt file with additional ICD10 entries.

More ICD10 entries are found in the i10idx0707.pdf file, a large index of ICD10 terms and their codes. This file is also available by anonymous ftp from the CDC server at:

ftp.cdc.gov
/pub/Health_Statistics/NCHS/Publications/ICD10CM/2007/i10idx0707.zip

Download and unzip this file (a freeware unzip utility is available at http://www.7zip.com/)

The unzipped file is an Adobe Acrobat pdf file 1,344 pages in length. An excerpt from the first page of the .pdf file is shown:

We need to convert this .pdf file into a .txt fie, if we expect to parse through the file and extract codes and matching terms. For most .pdf files, you can simply select, cut, and paste pages into a .txt file. This is a tricky method to use for very large files, because it requires a lot of memory. Sometimes the textual output from .pdf files is garbled or contains errors. Some .pdf files do not support cut and paste operations.

I like to convert .pdf files to .txt files in a two-step operation using free, open source command-line pdf utilities: pdftk and xpdf.

pdftk is available at: http://www.accesspdf.com/pdftk/
The zipped .exe file is pdftk-1.12.exe.zip

xpdf is available at: http://www.foolabs.com/xpdf/download.html
The zipped .exe file is xpdf-3.02pl2-win32.zip

Many .pdf files come in an internally compressed format. The compression is not apparent to the user. There is no special file extension, and the Adobe Reader software decompresses the file seamlessly. Before converting compressed .pdf files to text, we need to decompress the file.

From the command line in pdftk subdirectory, uncompress the compressed .pdf file with the following command. Remember to copy the i10idx0707.pdf file to the pdftk subdirectory.

C:\pdftk>pdftk i10idx0707.pdf output mydoc_un.pdf uncompress

Now we have an uncompressed .pdf file, mydoc_un.pdf

From the xpdf subdirectory, we can convert the uncompressed pdf file to a .txt file. Remember to copy the mydoc_un.pdf file to the pdftk subdirectory.

C:\xpdf>pdftotext mydoc_un.pdf mydoc.txt

This produces a text (ASCII) version of the icd10 index file:

MYDOC.TXT (2,360,138 bytes)

At last, we have two .txt files (mydoc.txt and each10.txt) that we can use, together, to create a clean, computer-parsable list of ICD codes and their equivalent terms, in English.

Here is the Perl script that does the job. You will need to place the mydoc.txt file and the each10.txt file in the same subdirectory as the Perl script.


#!/usr/bin/perl
open(TEXT, "mydoc.txt");
undef($/);
$var = <TEXT>;
close TEXT;
$var =~ tr/\14//d;
$var =~ s/ ([A-Z][0-9][0-9\.]*[0-9])/ $1\n/g;
open(TEXT, ">mydoc.out");
print TEXT $var;
close TEXT;
$/ = "\n";
open(TEXT, "mydoc.out");
$line = " ";
while ($line ne "")
  {
  $line = <TEXT>;
  if ($line =~ /\b([A-Z][0-9][0-9\.]*[0-9]) *$/)
     {
     $term = $`;
     $code = $1;
     $term =~ s/^[ \-]*//o;
     $term =~ s/[ \-]*$//o;
     $term = lc($term);
     $line =~ tr/\173//d;
     $dictionary{$code} = $term;
     }
  }
close TEXT;
open (ICD, "each10.txt")||die"cannot";
undef($/);
$line = <ICD>;  
$line =~ tr/\000-\011//d;
$line =~ tr/\013-\014//d;
$line =~ tr/\016-\037//d;
$line =~ tr/\041-\055//d;
$line =~ tr/\173-\377//d;
@linearray = split(/\n(?=[ ]*[A-Z][0-9\.]{1,5})/, $line);
foreach $thing (@linearray)
  {
  if ($thing =~ /^ *([A-Z][0-9\.]{1,5}) ?/)
    {
    $code = $1;
    $term = $';
    $term =~ s/\n//;
    $term =~ s/[ ]+$//;
    $dictionary{$code} = $term;
    }
  }
unlink("mydoc.out");
open (TEXT, ">mydoc.out");
foreach $key (sort keys %dictionary)
   {
   print TEXT "$key            $dictionary{$key}\n";
   }
close TEXT;
exit;

The output file is: mydoc.out (1,091,342 bytes). It contains about 23,000 code/term pairs

I renamed the mydoc.out file, icd10_pl.txt. We will use this file in the next Perl script, to determine the total number of each condition appearing in the 1+ Gbyte mort99us.dat file. You will need to placed the icd10_pl.txt file and the Mort99us.dat file in the same subdirectory as this Perl script.


#/usr/local/bin/perl
open (ICD, "icd10_pl.txt");
$line = " ";
while ($line ne "")
  {
  $line = <ICD>;
  if ($line =~ /^([A-Z][0-9\.]+) +/)
    {
    $code = $1;
    $term = $';
    $code =~ s/\.//o;
    $term =~ s/\n//;
    $term =~ s/ +$//;
    $dictionary{$code} = $term;
    }
  }
close ICD;
open (ICD, "Mort99us.dat");
$line = " ";
print "\n\n";
while ($line ne "")
  {
  $line = <ICD>;
  $codesection = substr($line,161,140);
  @codearray = split(/ +/,$codesection);
  foreach $code (@codearray)
    {
    $code =~ /[A-Z][0-9]+/;
    $code = $&;
    $counter{$code}++;
    }
  }
close ICD;
open (OUT, ">cdc.out");
while ((my $key, my $value) = each(%counter))
   {
   $value = "000000" . $value;
   $value = substr($value,-6,6);
   push(@filearray, "$value  $key   $dictionary{$key}");
   }
$outfile = join("\n", reverse(sort(@filearray)));
print OUT $outfile;
exit

On my 2.5 GHz CPU desktop computer, it takes well under a minute to parse through the 1+ Gbyte CDC mortality data set and produce the desired output file (cdc.out). The total number of records parsed by the script were 2394871. There are 5,650 conditions included in the 1999 CDC mortality data set.

The first 45 lines of the output file are:

412827 I251 Atherosclerotic heart disease
352559 I469 Cardiac arrest unspecified
273644 I500 Congestive heart failure
244162 I219 Acute myocardial infarction unspecified
210394 J449 Chronic obstructive pulmonary disease unspecified
206996 J189 Pneumonia unspecified
203906 I10 Essential primary hypertension
176834 I64 Stroke not specified as hemorrhage or infarction
162128 C349 Bronchus or lung unspecified
149777 E149 Without complications
143326 J969 Respiratory failure unspecified
129947 A419 Septicemia unspecified
115504 F03 Unspecified dementia
106137 N19 Unspecified renal failure
101224 I250 Atherosclerotic cardiovascular disease so described
075834 G309 Alzheimers disease unspecified
073365 I709 Generalized and unspecified atherosclerosis
072931 R092 Respiratory arrest
067525 I499 Cardiac arrhythmia unspecified
067055 I48 Atrial fibrillation and flutter
065657 C80 Malignant neoplasm without specification of site
056718 J690 Pneumonitis due to food and vomit
056425 C189 Colon unspecified
051698 C509 Breast unspecified
045707 C61 Malignant neoplasm of prostate
043191 I429 Cardiomyopathy unspecified
038127 N189 Chronic renal failure unspecified
038072 E119 Without complications
037810 I119 Hypertensive heart disease without congestive heart failure
036801 I739 Peripheral vascular disease unspecified
036151 D649 Anemia unspecified
035720 N390 Urinary tract infection site not specified
035552 K922 Gastrointestinal hemorrhage unspecified
035133 J439 Emphysema unspecified
031981 K746 Other and unspecified cirrhosis of liver
031327 E86 Volume depletion
031266 G20 Parkinsons disease
031050 N180 Endstage renal disease
030693 N179 Acute renal failure unspecified
030442 R99 Other illdefined and unspecified causes of mortality
030416 C259 Pancreas unspecified
030248 K729 Hepatic failure unspecified
028458 I255 Ischemic cardiomyopathy
026860 I269 Pulmonary embolism without mention of acute cor pulmonale
026445 I509 Heart failure unspecified

The top line is:

412827 I251 Atherosclerotic heart disease

It indicates that atherosclerotic heart disease is the most common condition listed in the death certificates in 1999 in the U.S. It was listed 412,827 times. The ICD10 code for Atherosclerotic heart disease is I25.1.

Some of the output lines do not seem particularly helpful. For example:

056425 C189 Colon unspecified
051698 C509 Breast unspecified

Nobody dies from "Colon unspecified." The strange diagnosis is explained by the rather unsatisfactory way that the ICD assigns terms to codes. In this case, "Colon unspecified" is a sub-term in the general category of "Neoplasms of the colon." We know this because all of the codes beginning with "C" (i.e., C189 and C509 in this case) are cancer codes. Whenever an ICD term appears un-informative, we can return to the ICD10_pl.txt file (created in yesterday's blog) and clarify its meaning by examining the root term for the sub-term.

In the next blog in this series, we will try a more ambitious project, using the CDC mortality data and the U.S. map in a Ruby mashup script.

As I remind readers in almost every blog post, if you want to do your own creative data mining, you will need to learn a little computer programming.

For Perl and Ruby programmers, methods and scripts for using a wide range of publicly available biomedical databases, are described in detail in my prior books:

Perl Programming for Medicine and Biology

Ruby Programming for Medicine and Biology

An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.

More information on cancer is available in my recently published book, Neoplasms.

© 2008 Jules Berman

As with all of my scripts, lists, web sites, and blog entries, the following disclaimer applies. This material is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the material or the use or other dealings in the material.

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, neoplasms, nomeclature, Perl script

Tuesday, February 12, 2008

Medical autocoding with Perl

In yesterday's blog, I showed a short, simple Ruby script that can provide quick and accurate medical autocoding for medical free-text. I also provided a web site where you could inspect 20,000 PubMed abstract titles and the extracted/coded terms produced by the Ruby autocoder.

Today, I'm providing a web site with the equivalent Perl medical autocoder, along with the public domain output file of 20,000 autocoded PubMed abstracts. Surprisingly (to me) the Perl code executed at about the same speed as the Ruby code. Both autocoders would have significant speed gains if they used the doublet method (which I didn't use here because I wanted to demonstrate the shortest possible scripts). The Perl code is contained on the web page.

- Jules Berman

Friday, January 18, 2008

Corrections to Perl scripts

I am very grateful to Dr. Robert McDowell for uncovering a presentation problem in all of the Perl scripts that I've posted in prior blogs.

When the blog software encounters a Perl get-file command (<FILENAME>), it misinterprts the Perl expression as an HTML tag and suppresses its visibility.

Consequently, all of the Perl scripts that called for a line of a file (i.e., most of my posted Perl files) had a missing filehandle. I've gone back through all of the old posts and have substituted HTML bracket expressions (<, >) where appropriate, and I think this should have fixed the problem.

I apologize for any inconvenience this may have causes.

- Jules Berman

Tuesday, January 15, 2008

Perl implementation of doublet deidentifier

Here is the Perl code for implementing the doublet deidentifier (medical record scrubber).

It operates on a collection of over 15000 PubMed Citations (author line and title line), and uses a publicly available external list of "safe" doublets. A plain-text file of doublets is available.

The entire output of the script is available for review.

As with all my distributed scripts, the following disclaimer applies:

The perl script for deidentifying text using the doublet method is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.


#!/usr/local/bin/perl
$begin = time();
open(TEXT,"doublets.txt")||die"cannot";
$line = " ";
while ($line ne "")
  {
  $line = $getline = <TEXT>;
  $getline =~ s/\n//;
  $doublethash{$getline}= "";
  }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following to create ";
print STDERR "the doublet hash is ";
print STDERR "$totaltime seconds.\n\n";
close TEXT;
$begin = time();
open(TEXT,"pathol5.txt")||die"cannot";
open(STDOUT,">pathol5.out")||die"cannot";
$line = " "; $oldthing = ""; $state = 0;
while ($line ne "")
   {
   $line = <TEXT>;
   next if ($line eq "\n");
   print "Original - $line" . "Scrubbed - " ;
   $line =~ s/[\,\.\n]//g;
   $line = lc($line);
   my @linearray = split(/ /,$line);
   push (@linearray, "lastword");
   foreach $thing (@linearray)
     {
     if ($oldthing eq "")
        {
        $oldthing = $thing;
        next;
        }
     $term = "$oldthing $thing";
     if (exists($doublethash{$term}))
        {
        print "$oldthing ";
        $oldthing = $thing;
        $state = 1;
        next;
        }
     if ($state == 1)
        {
        if ($thing eq "lastword")
          {
          print $oldthing;
          print "\.\n";
          $oldthing = "";
          $state = 0;
          next;
          }
        print "$oldthing ";
        $oldthing = $thing;
        $state = 0;
        next;
        }
     if ($state == 0)
        {
        if ($thing eq "lastword")
          {
          print "\*\.\n";
          $oldthing = "";
          next;
          }
        print "\* ";
        $oldthing = $thing;
        next;
        }
     }
   }
$end = time();
$totaltime = $end - $begin;
print STDERR "Time following ";
print STDERR "doublet hash creation";
print STDERR " is $totaltime seconds.";
exit;

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

Jules J. Berman, Ph.D., M.D.
tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, de-identification, deidentification, doublet method, medical scrubber, Perl script

Monday, January 14, 2008

Parsable Doublets List now available in public domain

Word doublets are two-word phrases that appear in text (i.e., they are not randomly chosen two-word sequences.

Doublets can be used in a variety of informatics projects: indexing, data scrubbing, nomenclature curation, etc. Over the next few days, I will provide examples of doublet-based informatics projects.

A list of over 200,000 word doublets is available for download.

The list was generated from a large narrative pathology text. Thus, the doublets included here would be particularly suitable for informatics projects involving surgical pathology reports, autopsy reports, pathology papers and books, and so on.

The Perl script that generated the list of doublets by parsing through a text file ("pathold.txt"), is shown:


#!/usr/local/bin/perl
open(TEXT,"pathold.txt")||die"cannot";
open(OUT,">doublets.txt")||die"cannot";
undef($/);
$var = <TEXT>;
$var =~ s/\n/ /g;
$var =~ s/\'s//g;
$var =~ tr/a-zA-Z\'\- //cd;
@words = split(/ +/, $var);
foreach $thing (@words)
  {
  $doublet = "$oldthing $thing";
  if ($doublet =~ /^[a-z]+ [a-z]+$/)
    {
    $doublethash{$doublet}="";
    }
  $oldthing = $thing;
  }
close TEXT;
@wordarray = sort(keys(%doublethash));
print OUT join("\n",@wordarray);
close OUT;
exit;

You can generate your own list by substituting any text file you like for "pathold.txt". Keep in mind that the Perl script slurps the entire text file into a string variable, so the script won't work if you use a file that exceeds the memory of the computer. For most computers (with RAM memories that exceed 256 MBytes) this will not be a problem. On my computer (about 2.8 GHz and 512 Mbyte RAM) the script takes about 5 seconds to parse a 9 Megabyte text file).

Since the doublet list below consists of a non-narrative collection of words, it cannot be copyrighted (i.e., it is distributed as a public domain file).

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

Sunday, July 1, 2007

Neoplasm classification structural validation

In yesterday's post, I announced the newest version of the Developmental Lineage Classification and Taxonomy of Neoplasms (also called the Neoplasm Classification).

When you have a nomenclature that contains hundreds of thousands of terms, and when new versions of the nomenclature are regularly released, you need computational methods to check the internal consistency of the nomenclature. The classification is in XML, and this makes it easy to write a multi-purpose parsing script.

The Perl script (below) has three purposes:

1. It checks that neocl.xml is well-formed xml

2. It checks that a concept identifying code in one class is not repeated in any
other class within neocl.xml

3. It checks that no term in neocl.xml is ever repeated

On my 2.5 GHz computer, the xmlvocab.pl Perl script takes about 4 seconds to parse and check the 10+ Megabyte neocl.xml file. The script provides messages indicating any problem terms in the nomenclature.

#!/usr/bin/perl
#xmlvocab.pl
#
#This Perl script was created by Jules J. Berman and
#updated on 5/19/2005
#
#Copyright (c) 2005 Jules J. Berman
#
#Permission is granted to copy, distribute and/or
#modify this document
#under the terms of the GNU Free Documentation
#License, Version 1.2
#or any later version published by the Free
#Software Foundation;
#with no Invariant Sections, no Front-Cover Texts,
#and no Back-Cover Texts.
#
#The software is provided "as is", without warranty
#of any kind, express or implied, including but not
#limited to the warranties of merchantability,
#fitness for a particular purpose and
#noninfringement. in no event shall the authors
#or copyright holders be liable for any claim, damages
#or other liability, whether in an action of contract,
#tort or otherwise, arising from, out of or in connection
#with the software or the use or other dealings in the
#software.
#
#An explanation of the classification can be found in
#the following two publications, which should be cited
#in any publication or work that may result from any
#use of this file.
#
#Berman JJ. Tumor classification: molecular analysis
#meets Aristotle. BMC Cancer 4:8, 2004.
#
#neocl.xml is the classification of all neoplastic lesions.
#
use XML::Parser;
my $parser = XML::Parser->new( Handlers => {
Init => \&handle_doc_start,
Final => \&handle_doc_end,
});
$file = "neocl.xml";
$parser -> parsefile($file);

sub handle_doc_start
{
print "\nBeginning to parse $file now\n";
}

sub handle_doc_end
{
print "\nFinished. $file is a well-formed XML File.\n";
}

open (TEXT, $file);
#open (OUT,">neocl.out");
my $countcode = 0;
my $line = " ";
my %code;
my $classname;
my $phrasecount = 0;
while ($line ne "")
{
$line = <TEXT>;
next unless ($line =~ /\<.+\>/);
if ($line =~ /^\<([a-z\_]+)\>/)
}
$classname = $1;
next;
}
if ($line =~ /[CS]([0-9]{7})/)
{
$phrasecount++;
if (exists $code{$&})
{
if ($code{$&} ne $classname)
{
print "$& is a problem\n";
}
}
else
{
$code{$&} = $classname;
$countcode++;
}
}
}
close TEXT;
print "The total number of concepts is $countcode\n";
print "The total number of phrases is $phrasecount\n";
open (TEXT, $file);
undef %code;
$line = " ";
my %item;
while ($line ne "")
{
$line = <TEXT>;
if ($line =~ /([CS])([0-9]{7})/)
{
$prefix = $1;
$line =~ /\"\> ?(.+) ?\<\//;
$phrase = $prefix . $1;
if (exists $item{$phrase})
{
print $. . " More than one occurrence of \"$phrase\"\n";
}
$item{$phrase}="";
}
}
close TEXT;
exit;

-Jules J. Berman

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, genetics of complex disease, genetics of common diseases, cryptic disease