Specified Life: April 2008

Friday, April 18, 2008

MeSH (Medical Subject Headings) more complex than simple "Trees"

MeSH (Medical Subject Headings) is a wonderful nomenclature of medical terms available from the U.S. National Library of Medicine.

The download site is:

http://www.nlm.nih.gov/mesh/filelist.html

MeSH is one of the greatest gifts provided by the U.S. National Library of Medicine and can be used freely for a variety of projects involving indexing, tagging, searching, retrieving, coding, analyzing, merging, and sharing biomedical text. In my opinion, there are many projects that rely on commercial and legally encumbered nomenclatures that would be better served by MeSH.

My only quibble with MeSH is that it is incorrectly described as a Tree structure.

Here is the official word (from the NLM website) on MeSH Trees from: http://www.nlm.nih.gov/mesh/intro_trees2007.html

"Because of the branching structure of the hierarchies, these lists are sometimes referred to as "trees". Each MeSH descriptor appears in at least one place in the trees, and may appear in as many additional places as may be appropriate. Those who index articles or catalog books are instructed to find and use the most specific MeSH descriptor that is available to represent each indexable concept."

When you look at individual entries in MeSH, you find that a single entry may be assigned multiple MeSH numbers.

For example, the MeSH term, "Family" is assigned two MeSH numbers,

MN = F01.829.263
MN = I01.880.225

The parent "number" for any MeSH entry is found by removing the last set of decimal demarcated digits.

For example:
F01.829.263 MeSH name, Family
F01.829 MeSH name, Plychology, Social
F01 MeSH Name, Behavior and Behavior Mechanisms

For each MeSH number, there is a separate hierarchy.

It is tempting to think of each hierarchy for each number as a tree (then MeSH could be envisioned as a dense forest), but each parent term could be assigned multiple MeSH numbers, each producing a multi-branching hierarchy.

Because each MeSH term (including the ancestral terms for a MeSh term) may be assigned multiple MeSH numbers, each with its own hierarchy, the MeSH data structure is more accurately thought of as a complex ontology, with terms existing in multiple classes, with specified relationships among any class and its parent classes.

The tree metaphor breaks down because branches and nodes within a branch can be connected to other branches and to other nodes. Trees do not do this kind of thing.

It is possible to write a script that parses through every MeSH entry, finds all of the MeSH numbers for the entry, determines the parent terms for the MeSH numbers, determines all of the alternate MeSH numbers for the parent terms, then finds all of the grandparent terms for all of the parent terms, etc., until all of the hierarchical terms for the term are found.

Here is the Perl script. This Perl script is provided "as is", by its creator, Jules J. Berman, without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.


#!/usr/local/bin/perl
open(MESH, "D2007.BIN"); #the file name for the raw ascii MeSH version
open(OUT, ">mesh.out");
$/ = "\n\n\*NEWRECORD\n";
$line = " ";
@cumlist;
%numberhash;
%namehash;
while ($line ne "")
  {
  my $numbers = "";
  $line = <MESH>;
  $line =~ /\nMH = ([^\n]+)\n/;
  $name = $1;
  while ($line =~ m/\nMN ?= ?([^\n]+)(?=\n)/mg)
     {
     $number = $1;
     $number =~ s/^ *//o;
     $number =~ s/ *$//o;
     $number =~ s/ +/ /;
     $numberhash{$number} = $name;
     $numbers = $numbers . " " . $number;
     }
  $numbers =~ s/^ *//o;
  $numbers =~ s/ *$//o;
  $numbers =~ s/ +/ /o;
  $namehash{$name} = $numbers;
  }
close(MESH);
while((my $key, my $value) = each (%namehash))
   {
   @cumlist = ("");
   print OUT "\nTERM LINEAGE FOR " . uc($key) . "\n";
   my @valuelist = split(/ /,$value);
   @cumlist = (@cumlist, @valuelist);
   &splitlist(@cumlist);
   for(1..30)
     {
     @cumlist = grep { $marked{$_}++; $marked{$_} == 1; } 
             @cumlist;
     undef(%marked);
     &allmeshnums(@cumlist);
     @cumlist = grep { $marked{$_}++; $marked{$_} == 1; } 
             @cumlist;
     undef(%marked);
     &splitlist(@cumlist);
     }
   @cumlist = grep { $marked{$_}++; $marked{$_} == 1; } 
             @cumlist;
   undef(%marked);
   foreach my $thing (@cumlist)
      {
      print OUT "$thing $numberhash{$thing}\n";
      }
   }
sub splitlist()
   {
   @valuelist = @_;
   foreach my $meshno (@valuelist)
     {
     for(1..30)
       {                      
       if ($meshno =~ /\.[0-9]+$/)
          {
          $meshno = $`;
          push(@cumlist, $meshno);
          }
       else
          {
          last;
          }
       }
     }
    }
sub allmeshnums()
  {
  @meshnumber = @_;
  foreach my $thing (@meshnumber)
    {
    my $name = $numberhash{$thing};
    my $value = $namehash{$name};
    my @valuelist = split(/ /,$value);
    @cumlist = (@valuelist, @cumlist);
    }
  }
exit;

The output file, mesh.out is over 9 megabytes in length.

Here is an example of one entry, in the output file, mesh.out


TERM LINEAGE FOR GIANT CELLS, FOREIGN-BODY
A11.118.637 Leukocytes
A15.145.229.637 Leukocytes
A15.382.490 Leukocytes
A11.118.637.555 Leukocytes, Mononuclear
A15.145.229.637.555 Leukocytes, Mononuclear
A15.382.490.555 Leukocytes, Mononuclear
A15.378 Hematopoietic System
A11.148 Bone Marrow Cells
A15.378.316 Bone Marrow Cells
A12.207.152 Blood
A15.145 Blood
A11.118 Blood Cells
A15.145.229 Blood Cells
A11.329.372.376 Giant Cells, Foreign-Body
A11.502.376 Giant Cells, Foreign-Body
A11.627.624.480.376 Giant Cells, Foreign-Body
A11.733.397.376 Giant Cells, Foreign-Body
A15.382.680.397.376 Giant Cells, Foreign-Body
A15.382.812.522.376 Giant Cells, Foreign-Body
A11.329 Connective Tissue Cells
A11 Cells
A11.502 Giant Cells
A11.118.637.555.652 Monocytes
A11.148.580 Monocytes
A11.627.624 Monocytes
A11.733.547 Monocytes
A15.145.229.637.555.652 Monocytes
A15.378.316.580 Monocytes
A15.382.490.555.652 Monocytes
A15.382.680.547 Monocytes
A15.382.812.547 Monocytes
A11.627 Myeloid Cells
A11.733 Phagocytes
A15.382.680 Phagocytes
A15.382 Immune System
A15 Hemic and Immune Systems
A11.329.372 Macrophages
A11.627.624.480 Macrophages
A11.733.397 Macrophages
A15.382.680.397 Macrophages
A15.382.812.522 Macrophages
A15.382.812 Reticuloendothelial System
A12.207 Body Fluids
A12 Fluids and Secretions

When we examine the multi-lineage ancestry of "Foreign body giant cells" we see that MeSH is not a tree hierarchy. This means that the MeSH data structure is highly complex and requires some computational know-how to fully explore all the term relationships.

Jules Berman

tags: Perl programming for medicine and biology, nomenclature, thesaurus, nlm, medical subject headings, open source, medical indexing, medical data retrieval, medical informatics, biomedical informatics, national library of medicine

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Thursday, April 17, 2008

Medical Informatics has barely advanced in 50 years

Medical informaticians today are working on the same problems that they were working on in the early 1960s. I recently did a PubMed search on "computers AND medical records" and pulled the citations from the earliest dates captured by PubMed.

Here are some of the titles.

1965 Apr
AUTOMATING YOUR OWN MEDICAL DATA PROCESSING CAN BE COSTLY.

1965 Apr
AKRON SPEEDS INFORMATION SYSTEM SLOWLY.

1965 Apr
COMPUTER ALLOWS A ROUTINE ECG FOR EVERY ADMISSION.

1965 Mar 15
APPLICATION OF COMPUTERS IN CLINICAL PRACTICE.

1965 Mar
MECHANIZING A LARGE REGISTER OF FIRST ORDER PATIENT DATA.

1965 Feb
COMPUTER PROCESSING OF NEURORADIOLOGICAL REPORTS. AN INTRODUCTION TO THE APPLICATION OF THE VARIABLE-FIELD-LENGTH FORMAT AND MEDTRAN.

1965 Feb
[Storage and retrieval of medical records in 1964]

1964 Dec 21
COMPUTER HANDLING OF AMBULATORY CLINIC RECORDS.

1964 Nov
STORAGE AND RETRIEVAL OF CLINICAL AND LABORATORY DATA.

1964 Oct 16
AUTOMATION IN MEDICAL RECORDS: A LOOK AHEAD.

1964 Oct
COMPUTER GENERATED HOSPITAL DIAGNOSIS FILE.

1964 Jul 4
THE ROLE OF THE COMPUTER IN REFINING DIAGNOSIS.

1964 Jun 15
MANIPULATION OF AUTOPSY DIAGNOSES BY COMPUTER TECHNIQUE.

1964 Jun
REQUIREMENTS AND APPLICATIONS OF AUTOMATION IN HOSPITAL FUNCTIONS.

1964 Apr 9
[ELECTRONIC DATA PROCESSING IN CLINICAL ANESTHETIC PRACTICE.]

1964 Apr
DESIGN OF A COMPUTER SUPPORTED CLINICAL STUDY.

1963 Dec
OBSTETRICAL DATA PROCESSING: THE COMPUTER AS AN OBSTETRIC DATA RETRIEVAL DEVICE.

1963 Nov
A COMPUTER SYSTEM FOR CLASSIFYING CARDIOPULMONARY DISABILITY.

1963 Sep
A CENTRAL ELECTRONIC COMPUTER SPEEDS PATIENT INFORMATION.

1962 May
Patient data: a computer-based system.

In the 60s, our predecessors worked on the same problems that persist today: the electronic medical record; standardization of electronic information; using electronic information to catch errors in diagnosis, billing; analysis issues for electronic data.

It seems hard to believe. How could early informaticians have been so advanced with the primitive computers available in the 1960s? If we are still working on the same problems today that we were working on in 1960s, have we wasted the enormous research funding that was devoted to medical informatics projects through the past five decades? At the very least, these observations signify that present-day medical informaticians are less advanced than we may like to think.

- Jules Berman

key words: progress in medical informatics, hospital computerization, medical advancement, medical progress, hospital information systems, emr, ehr, electronic health record, electronic medical record, medical errors, medical data, hospital data, medical informaticists

Wednesday, April 16, 2008

Perl script for extracting lineages of organisms in EBI Taxonomy

In the past two blogs, I presented Ruby and Python scripts to create phylogentic lineages for species included in taxonomy.dat. Here is the equivalent project, in Perl.

Taxonomy.dat is a large, publicly available list of organisms. The file is available from the European Bioinformatics Institute (EBI). It contains over 400,000 species:

[A sample record in Taxonomy.dat]
ID : 438
PARENT ID : 434
RANK : species
GC ID : 11
SCIENTIFIC NAME : Acetobacter pasteurianus
SYNONYM : Acetobacter lovaniense
SYNONYM : Acetobacter alcoholophilus
SYNONYM : Acetobacter pasteurianus (Hansen 1879) Beijerinck and Folpmers 1916
SYNONYM : "Ulvina pasteuriana" (Hansen 1879) Pribram 1933
SYNONYM : "Pseudomonas pomi" Cole 1959
SYNONYM : "Mycoderma pasteurianum" Hansen 1879
SYNONYM : Acetobacter pasteurianus ascendens
SYNONYM : Acetobacter pasteurianus paradoxus
SYNONYM : "Acetobacter alcoholophilus" Kozulis and Parsons 1958
SYNONYM : "Acetobacter kutzigianus" (sic) (Hansen 1894) Bergey et al. 1923
SYNONYM : "Acetobacter mobile" (sic) Tosic and Walker 1944
SYNONYM : "Acetobacter vini-aceti" (Henneberg 1906) Shimwell 1948
SYNONYM : "Bacterium vini-aceti" Henneberg 1906
SYNONYM : "Bacterium rancens" Beijerinck 1898
SYNONYM : "Bacillus kuttingianum" (sic) (Hansen 1894) Takahashi 1906
SYNONYM : "Bacillus pasteurianus" (Hansen 1879) Flugge 1886
SYNONYM : "Bacterium pastorianum" (Hansen 1879) Zopf 1883
SYNONYM : "Bacterium kutzingianum" Hansen 1894
SYNONYM : "Bacteriopsis pasteuriana" (Hansen 1879) Trevisan 1885
SYNONYM : Acetobacter agglutinans
SYNONYM : Acetobacter acidum-mucosum
SYNONYM : "Bacillus pasteurianus" (Hansen 1879) Fl gge 1886
SYNONYM : "Acetobacter turbidans" Cosbie et al. 1942
SYNONYM : Acetobacter kutzigianus
SYNONYM : Acetobacter mobile
SYNONYM : Acetobacter turbidans
SYNONYM : Acetobacter vini-aceti
SYNONYM : Acetobacter pasteurianus subsp. orleanensis
SYNONYM : Acetobacter pasteurianus orleanensis
SYNONYM : Bacillus kuttingianum
SYNONYM : "Acetobacter acidum-mucosum" (sic) Tosic and Walker 1950
SYNONYM : Bacteriopsis pasteuriana
SYNONYM : Bacterium kutzingianum
SYNONYM : Acetobacter rancens
SYNONYM : "Acetobacter agglutinans" Frateur 1950
SYNONYM : Ulvina pasteuriana
SYNONYM : Pseudomonas pomi
SYNONYM : Mycoderma pasteurianum
SYNONYM : Bacterium vini-aceti
SYNONYM : Bacterium pastorianum
SYNONYM : Bacterium rancens
INCLUDES : Acetobacter turbidans ATCC 9325
INCLUDES : Acetobacter turbidans ATCC9325
IN-PART : Bacillus pasteurianus
//

The taxonomy.dat file exceeds 100 megabytes in length.

The taxonomy.dat file is available for public download through anonymous ftp.

[ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/]

Information about the taxonomy.dat file is found at:

[http://www.ebi.ac.uk/msd-srv/docs/dbdoc/ref_taxonomy.html]
Notice that the sample entry (above) provides an ID number for the entry organism, and for it's parent class. Since every organism and class has a parent, you can write a script that reconstructs the full phylogenetic lineage for any entry in taxonomy.dat.

In this blog, I include a Perl script that parses through taxonomy.dat, builds a hash of all the child-parent relationships, then re-parses the file, building the phylogenetic lineage of each organism using the child-parent hash that was built in the first pass.

This Perl script is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

A copy of the Perl script is available at: http://www.julesberman.info/taxon.htm.

It takes under a minute to execute this script on a desktop computer running at 2.6 MHz with 512 MByte RAM. You may need this much RAM to provide memory for the hash (of child-parent relationships).


#!/usr/local/bin/perl
open(TAXO, "taxonomy.dat");
open(OUT, ">taxo.txt");
$/ = "//";
$line = " ";
while ($line ne "")
  {
  $line = <TAXO>;
  $line =~ /\nID +\: *([0-9]+) *\n/;
  $id_name = $1;
  $line =~ /\nPARENT ID +\: *([0-9]+) *\n/;
  $parent_id_name = $1;
  $parenthash{$id_name} = $parent_id_name;
  $line =~ /\nSCIENTIFIC NAME +\: *([^\n]+) *\n/;
  $scientific_name = $1;
  $namehash{$id_name} = $scientific_name;
  }
close(TAXO);
open(TAXO, "taxonomy.dat");
$line = " ";
while ($line ne "")
  {
  $line = <TAXO>;
  $getline = $line;
  $getline =~ s/\/\///o;
  print OUT $getline . "HIERARCHY\n";
  $line =~ /\nID +\: *([0-9]+) *\n/;
  $id_name = $1;
  for(1..30)
    {
    print OUT "$namehash{$id_name}\n";
    $id_name = $parenthash{$id_name};
    last if ($namehash{$id_name} eq "root");
    }
  print OUT "//";
  }
exit;

The script produces an output file, taxo.txt that exceeds 224 Megabytes in length. The output consists of the taxonomic entries from taxonomy.dat, along with the phylogentic lineage for each organism.

An sample ancestral lineage, for "maple trees" is:


Maple trees
ID                        : 4022
PARENT ID                 : 23672
RANK                      : genus
GC ID                     : 1
MGC ID                    : 1
SCIENTIFIC NAME           : Acer
GENBANK COMMON NAME       : maple trees
SYNONYM                   : Acer L.
HIERARCHY
Acer
Sapindaceae
Sapindales
eurosids II
rosids
core eudicotyledons
eudicotyledons
Magnoliophyta
Spermatophyta
Euphyllophyta
Tracheophyta
Embryophyta
Streptophytina
Streptophyta
Viridiplantae
Eukaryota
cellular organisms
//

A web site that produces the phylogeny of any entered species (in taxonomy.dat) is available at: http://www.julesberman.info/post.htm

key words: python programming language, phylogeny, taxonomy, taxa, taxon, ancestral lineage, classification, phylogenetics, python script, scripting language, species, phylum, genus

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D.

Tuesday, April 15, 2008

Python script to extract phylogenetic lineages using the EBI taxonomy

In yesterday's blog, I discussed a Ruby script for creating phylogentic lineages for species included in taxonomy.dat. Here is the equivalent project, in Python.

Taxonomy.dat is a large, publicly available list of organisms. The file is available from the European Bioinformatics Institute (EBI). It contains over 400,000 species:

[A sample record in Taxonomy.dat]

ID : 350094
PARENT ID : 343736
RANK : species
GC ID : 1
MGC ID : 5
SCIENTIFIC NAME : Omalisus fontisbellaquei
MISSPELLING : Omalisus fontisbellaquaei
MISSPELLING : Omalisis fontisbellaguei
//

The taxonomy.dat file exceeds 100 megabytes in length.

The taxonomy.dat file is available for public download through anonymous ftp.

[ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/]

Information about the taxonomy.dat file is found at:

[http://www.ebi.ac.uk/msd-srv/docs/dbdoc/ref_taxonomy.html]

Notice that the sample entry (above) provides an ID number for the entry organism, and for it's parent class. Since every organism and class has a parent, you can write a script that reconstructs the full phylogenetic lineage for any entry in taxonomy.dat.

In this blog, I include a Python script that parses through taxonomy.dat, builds a hash of all the child-parent relationships, then re-parses the file, building the phylogenetic lineage of each organism using the child-parent hash that was built in the first pass.

This Python script is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

A copy of the Python script is available at: http://www.julesberman.info/taxon.htm.

It takes under a minute to execute this script on a desktop computer running at 2.6 MHz with 512 MByte RAM. You may need this much RAM to provide memory for the hash (of child-parent relationships).


#!/usr/local/bin/python
import re
intext = open("taxonomy.dat", "r")
outtext = open("taxo.txt", "w")
parenthash = {}
namehash = {}
cum_line = ""
childnumber = ""
parentnumber = ""
child_match = re.compile('ID\s+\:\s*(\d+)\s*')
parent_match = re.compile('PARENT ID\s+\:\s*(\d+)\s*')
name_match = re.compile('SCIENTIFIC NAME\s+\:\s*([^\n]+)\s*')
end_match = re.compile('\/\/')
for line in intext:
  p = end_match.search(line)
  if p:
    m = child_match.search(cum_line)
    if m:
      childnumber = m.group(1)
    x = parent_match.search(cum_line)
    if x:
      parentnumber = x.group(1)
    parenthash[childnumber] = parentnumber
    y = name_match.search(cum_line)
    if y:
      scientific_name = y.group(1)
    namehash[childnumber] = scientific_name
    #print childnumber + " " + namehash[childnumber] + " " + parenthash[childnumber]
    cum_line = ""
    continue
  else:
    cum_line = cum_line + line 
cum_line = ""
intext.close
intext = open("taxonomy.dat", "r")
for line in intext:
  p = end_match.search(line)
  if p:
    print>>outtext, cum_line + "HIERARCHY"
    z = child_match.search(cum_line)
    if z:
      id_name = z.group(1)
    for i in range(30):
      if namehash.has_key(id_name):
        print>>outtext, namehash[id_name]
      if parenthash.has_key(id_name):
        id_name = parenthash[id_name]
    print>>outtext, "//"
    cum_line = ""
    continue
  else:
    cum_line = cum_line + line 
cum_line = ""
exit


ID                        : 9900
PARENT ID                 : 27592
RANK                      : genus
GC ID                     : 1
MGC ID                    : 2
SCIENTIFIC NAME           : Bison
HIERARCHY
Bison
Bovinae
Bovidae
Pecora
Ruminantia
Cetartiodactyla
Laurasiatheria
Eutheria
Theria
Mammalia
Amniota
Tetrapoda
Sarcopterygii
Euteleostomi
Teleostomi
Gnathostomata
Vertebrata
Craniata
Chordata
Deuterostomia
Coelomata
Bilateria
Eumetazoa
Metazoa
Fungi/Metazoa group
Eukaryota
cellular organisms

A web site that produces the phylogeny of any entered species (in taxonomy.dat) is available at: http://www.julesberman.info/post.htm

- Jules Berman

tags: python programming language, phylogeny, taxonomy, taxa, taxon, ancestral lineage, classification, phylogenetics, python script, scripting language, species, phylum, genus

Monday, April 14, 2008

Ruby script for building phylogenetic lineages from EBI's taxonomy.dat

Taxonomy.dat is a large, publicly available list of organisms. The file is available from the European Bioinformatics Institute (EBI). It contains over 400,000 species:

[A sample record in Taxonomy.dat]

[ID : 50]
[PARENT ID : 49]
[RANK: genus]
[GC ID : 11]
[SCIENTIFIC NAME : Chondromyces]
[SYNONYM : Polycephalum]
[SYNONYM : Myxobotrys]
[SYNONYM : Chondromyces Berkeley and Curtis 1874]
[SYNONYM : "Polycephalum" Kalchbrenner and Cooke 1880]
[SYNONYM : "Myxobotrys" Zukal 1896]
[MISSPELLING : Chrondromyces]

The taxonomy.dat file exceeds 100 megabytes in length.

The taxonomy.dat file is available for public download through anonymous ftp.

[ftp://ftp.ebi.ac.uk/pub/databases/taxonomy/]

Information about the taxonomy.dat file is found at:

[http://www.ebi.ac.uk/msd-srv/docs/dbdoc/ref_taxonomy.html]

Notice that the sample entry (above) provides an ID number for the entry organism, Chondromyces, and for it's parent class. Since every organism and class has a parent, you can write a script that reconstructs the full phylogenetic lineage for any entry in taxonomy.dat.

In this blog, I include a Ruby script that parses through taxonomy.dat, builds a hash of all the child-parent relationships, then re-parses the file, building the phylogenetic lineage of each organism using the child-parent hash that was built in the first pass.

This Ruby script is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

A copy of the Ruby script is available at: http://www.julesberman.info/taxon.htm.

Lately, I've gotten into the habit of creating Perl and Python versions of my Ruby scripts, and these are also available at the web site. All three scripts operate at about the same speed. It takes under a minute to execute these scripts on a desktop computer running at 2.6 MHz with 512 MByte RAM. You may need this much RAM to provide memory for the hash (of child-parent relationships).


#!/usr/local/bin/ruby
intext = File.open("taxonomy.dat", "r")
outtext = File.open("taxo.txt", "w")
parenthash = Hash.new()
namehash = Hash.new()
intext.each_line("//") do
  |line|
  line =~ /\nID\s+\:\s*([0-9]+)\s*\n/
  child_id = $1
  line =~ /\nPARENT ID\s+\:\s*([0-9]+)\s*\n/
  parent_id = $1
  parenthash[child_id] = parent_id
  line =~ /\nSCIENTIFIC NAME\s+\:\s*([^\n]+)\s*\n/
  scientific_name = $1
  namehash[child_id] = scientific_name
end
intext.close
intext = File.open("taxonomy.dat", "r")
intext.each_line("//") do
  |line|
  getline = line
  getline.sub!(/\/\//,"")
  outtext.puts(getline, "HIERARCHY")
  line =~ /\nID\s+\:\s*([0-9]+)\s*\n/
  id_name = $1
  (1..30).each do
    outtext.puts(namehash[id_name])
    id_name = parenthash[id_name]
    break if namehash[id_name].nil?
  end
  outtext.print("//")
end
exit


ID                        : 89151
PARENT ID                 : 4640
RANK                      : species
GC ID                     : 1
MGC ID                    : 1
SCIENTIFIC NAME           : Musa x paradisiaca
SYNONYM                   : Musa paradisiaca
SYNONYM                   : Musa sapientum
SYNONYM                   : Musa acuminata x Musa balbisiana
SYNONYM                   : Musa x paradisiaca L.
SYNONYM                   : Musa x sapientum L.
SYNONYM                   : Musa x sapientum
COMMON NAME               : banana
INCLUDES                  : Musa sp. RN-2001
MISSPELLING               : Musa lactan
HIERARCHY
Musa x paradisiaca
Musa
Musaceae
Zingiberales
commelinids
Liliopsida
Magnoliophyta
Spermatophyta
Euphyllophyta
Tracheophyta
Embryophyta
Streptophytina
Streptophyta
Viridiplantae
Eukaryota
cellular organisms

A web site that produces the phylogeny of any entered species (in taxonomy.dat) is available at: http://www.julesberman.info/post.htm

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, ruby programming language, python programming language, perl programming language, phylogeny, taxonomy, taxa, taxon, ancestral lineage, classification

Saturday, April 12, 2008

Phylogeny extraction from taxonomy.dat, in Ruby, Perl, and Python

I just loaded a new web page that contains equivalent scripts written in Ruby, Perl and Python. Each script will expand the 100+ Megabyte European Bioinformatics Institute's taxonomy.dat file to include the complete phylogenetic lineage for each species included in the file.

Instructions for downloading taxonomy.dat are included at the site. This is an incredible file. There are over 400,000 species listed in taxonomy.dat.

Ruby, Perl and Python scripts:

http://www.julesberman.info/taxon.htm

As discussed in an earlier blog, the site for obtaining the lineage of individually entered species, via a query box, is at:

http://www.julesberman.info/post.htm

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, classification, organisms, taxa, taxon, taxonomy, nomenclature, ruby programming, perl programming, python programming

Friday, April 11, 2008

A web site for viewing the phylogeny of organisms in the EBI taxonomy

Phylogeny is the ancestral lineage for an organism, begining at the species level and descending to the primordial organism from which all extant species evolved.

For example, the lineage for Homo sapiens is:

Homo sapiens
Homo
Homo/Pan/Gorilla group
Hominidae
Hominoidea
Catarrhini
Simiiformes
Haplorrhini
Primates
Euarchontoglires
Eutheria
Theria
Mammalia
Amniota
Tetrapoda
Sarcopterygii
Euteleostomi
Teleostomi
Gnathostomata
Vertebrata
Craniata
Chordata
Deuterostomia
Coelomata
Bilateria
Eumetazoa
Metazoa
Fungi/Metazoa group
Eukaryota
cellular organisms

To the best of my knowledge, the most complete source of information on organismal phylogeny is taxonomy.dat a 100+ Mbyte file available from the European Bioinformatics Institute (EBI).

The taxonomy.dat file lists over 400,000 species, as a taxonomy (i.e., a list with a a parent class assigned to all species). Taxonomy.dat provides a species id number and an id number for the parent class for each taxonomic entry.

Using this information, it is possible to compute the complete classification hierarchy for each of the 400 hundred thousand plus named organisms in taxonomy.dat.
Though there are millions of species of organisms on earth, the taxonomy.dat file is one of the most comprehensive taxonomic references available to scientists.

I have created a web site that allows users to enter the name (scientific name or common name) of any species listed in the EBI's taxonomy file and get back a full listing of the descending phylogenetic class hierarchy for the species.

It is available at:

http://www.julesberman.info/post.htm

The same search box is available here:

After entering the full name of a species into the input box, and pressing the submit button, you get the taxonomic entry for the organism, followed by its phylogenetic hierarchy.

If you can't think of an organism to enter in the query box (above), here are some suggestions:

Mycobacterium
Maple trees
Pea
Banana
Helicobacter pylori
Coccidioides immitis
Kangaroo
Toxoplasma gondii
Chicken
Actinomyces

- Jules Berman

key words: tree of life, classification, class hierarchy, phyla, genus, species, kingdom, class, order, phylum, living organisms, taxonomy, ontology, living organisms, animals, botany, zoology, mycology, eukaryotes, bacteria

Friday, April 4, 2008

Generating prime numbers in Ruby, Python and Perl

If you need to generate prime numbers, the classic algorithm is the Sieve of Eratosthenes. Both the Sieve and a simpler, but slower method (the one shown here) are discussed in Mastering Algorithms with Perl, by Orwant, Hietaniemi, and Macdonald (O'Reilly, 1999).

Here are the equivalent Ruby, Python and Perl scripts.

As with all of my scripts, the following disclaimer applies. Each of these three scripts are provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

For information about using Ruby for biomedical projects, you may want to read my recently published book:

Ruby Programming for Medicine and Biology

For information about using Perl for biomedical projects, you may want to read my recently published book:

Perl Programming for Medicine and Biology

Here is the Ruby script:


#!/usr/local/bin/ruby
state = Numeric.new
print "2,3,"
(4..10000).each do
   |i|
   (2..(Math.sqrt(i).ceil)).each do
      |thing|
      state = 1
      if (i.divmod(thing)[1] == 0)
         state = 0
         break
      end
   end
   print "#{i}\," unless (state == 0)
end 
exit

Here is the equivalent Python script:


#!/usr/local/bin/python
import math
print "2,3,"
state = 1
for i in range(4, 10000):
   upper = math.sqrt(i)
   upper = int(upper)
   for thing in range(2, upper):
      state = 1
      if (i % thing == 0):
         state = 0
         break
   if (state == 1): 
      print i,
exit

Here's the equivalent Perl script.


#!/usr/local/bin/perl
print "2,3,";
for($i=4;$i<10000;$i++)
   {
   for $thing (2 .. int(sqrt($i)))
      {
      $state = 1;
      if ($i % $thing == 0)
         {
         $state = 0;
         last;
         }
      }
   print "$i\," unless ($state == 0);
   }
exit;

Sample output of the scripts is available.

The scripts are pretty straightforward. A prime number cannot be the product of two integers. We test each number up to 10,000 to determine if it is a prime number. If the number is prime, then there will be no smaller number that will divide into the number with a remainder of 0. So we test each number smaller than the number, to see if it divides into the number with anything other than a zero remainder. If not, then the number is a prime.

We stop when we get up to the square root of the number. If there were an integer larger than the square root of the number, which could be multiplied by another integer to give the number, then the other integer would need to be smaller than the square root of the number (otherwise, the two integers would produce a product larger than the number). But we've already tested all of the numbers smaller than the square root of the number, and they all yielded a non-zero remainder. So we don't need to test the integers greater than the square root of the number.

- Jules J. Berman, Ph.D., M.D. tags: prime numbers, prime number generation, calculating prime numbers, ruby programming, perl programming, python programming