Wednesday, April 16, 2008

Perl script for extracting lineages of organisms in EBI Taxonomy

In the past two blogs, I presented Ruby and Python scripts to create phylogentic lineages for species included in taxonomy.dat. Here is the equivalent project, in Perl.

Taxonomy.dat is a large, publicly available list of organisms. The file is available from the European Bioinformatics Institute (EBI). It contains over 400,000 species:

[A sample record in Taxonomy.dat]
ID : 438
RANK : species
GC ID : 11
SCIENTIFIC NAME : Acetobacter pasteurianus
SYNONYM : Acetobacter lovaniense
SYNONYM : Acetobacter alcoholophilus
SYNONYM : Acetobacter pasteurianus (Hansen 1879) Beijerinck and Folpmers 1916
SYNONYM : "Ulvina pasteuriana" (Hansen 1879) Pribram 1933
SYNONYM : "Pseudomonas pomi" Cole 1959
SYNONYM : "Mycoderma pasteurianum" Hansen 1879
SYNONYM : Acetobacter pasteurianus ascendens
SYNONYM : Acetobacter pasteurianus paradoxus
SYNONYM : "Acetobacter alcoholophilus" Kozulis and Parsons 1958
SYNONYM : "Acetobacter kutzigianus" (sic) (Hansen 1894) Bergey et al. 1923
SYNONYM : "Acetobacter mobile" (sic) Tosic and Walker 1944
SYNONYM : "Acetobacter vini-aceti" (Henneberg 1906) Shimwell 1948
SYNONYM : "Bacterium vini-aceti" Henneberg 1906
SYNONYM : "Bacterium rancens" Beijerinck 1898
SYNONYM : "Bacillus kuttingianum" (sic) (Hansen 1894) Takahashi 1906
SYNONYM : "Bacillus pasteurianus" (Hansen 1879) Flugge 1886
SYNONYM : "Bacterium pastorianum" (Hansen 1879) Zopf 1883
SYNONYM : "Bacterium kutzingianum" Hansen 1894
SYNONYM : "Bacteriopsis pasteuriana" (Hansen 1879) Trevisan 1885
SYNONYM : Acetobacter agglutinans
SYNONYM : Acetobacter acidum-mucosum
SYNONYM : "Bacillus pasteurianus" (Hansen 1879) Fl gge 1886
SYNONYM : "Acetobacter turbidans" Cosbie et al. 1942
SYNONYM : Acetobacter kutzigianus
SYNONYM : Acetobacter mobile
SYNONYM : Acetobacter turbidans
SYNONYM : Acetobacter vini-aceti
SYNONYM : Acetobacter pasteurianus subsp. orleanensis
SYNONYM : Acetobacter pasteurianus orleanensis
SYNONYM : Bacillus kuttingianum
SYNONYM : "Acetobacter acidum-mucosum" (sic) Tosic and Walker 1950
SYNONYM : Bacteriopsis pasteuriana
SYNONYM : Bacterium kutzingianum
SYNONYM : Acetobacter rancens
SYNONYM : "Acetobacter agglutinans" Frateur 1950
SYNONYM : Ulvina pasteuriana
SYNONYM : Pseudomonas pomi
SYNONYM : Mycoderma pasteurianum
SYNONYM : Bacterium vini-aceti
SYNONYM : Bacterium pastorianum
SYNONYM : Bacterium rancens
INCLUDES : Acetobacter turbidans ATCC 9325
INCLUDES : Acetobacter turbidans ATCC9325
IN-PART : Bacillus pasteurianus

The taxonomy.dat file exceeds 100 megabytes in length.

The taxonomy.dat file is available for public download through anonymous ftp.


Information about the taxonomy.dat file is found at:

Notice that the sample entry (above) provides an ID number for the entry organism, and for it's parent class. Since every organism and class has a parent, you can write a script that reconstructs the full phylogenetic lineage for any entry in taxonomy.dat.

In this blog, I include a Perl script that parses through taxonomy.dat, builds a hash of all the child-parent relationships, then re-parses the file, building the phylogenetic lineage of each organism using the child-parent hash that was built in the first pass.

This Perl script is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

A copy of the Perl script is available at:

It takes under a minute to execute this script on a desktop computer running at 2.6 MHz with 512 MByte RAM. You may need this much RAM to provide memory for the hash (of child-parent relationships).

open(TAXO, "taxonomy.dat");
open(OUT, ">taxo.txt");
$/ = "//";
$line = " ";
while ($line ne "")
$line = <TAXO>;
$line =~ /\nID +\: *([0-9]+) *\n/;
$id_name = $1;
$line =~ /\nPARENT ID +\: *([0-9]+) *\n/;
$parent_id_name = $1;
$parenthash{$id_name} = $parent_id_name;
$line =~ /\nSCIENTIFIC NAME +\: *([^\n]+) *\n/;
$scientific_name = $1;
$namehash{$id_name} = $scientific_name;
open(TAXO, "taxonomy.dat");
$line = " ";
while ($line ne "")
$line = <TAXO>;
$getline = $line;
$getline =~ s/\/\///o;
print OUT $getline . "HIERARCHY\n";
$line =~ /\nID +\: *([0-9]+) *\n/;
$id_name = $1;
print OUT "$namehash{$id_name}\n";
$id_name = $parenthash{$id_name};
last if ($namehash{$id_name} eq "root");
print OUT "//";

The script produces an output file, taxo.txt that exceeds 224 Megabytes in length. The output consists of the taxonomic entries from taxonomy.dat, along with the phylogentic lineage for each organism.

An sample ancestral lineage, for "maple trees" is:

Maple trees
ID : 4022
PARENT ID : 23672
RANK : genus
GC ID : 1
MGC ID : 1
eurosids II
core eudicotyledons
cellular organisms

A web site that produces the phylogeny of any entered species (in taxonomy.dat) is available at:

key words: python programming language, phylogeny, taxonomy, taxa, taxon, ancestral lineage, classification, phylogenetics, python script, scripting language, species, phylum, genus

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D.