Monday, April 14, 2008

Ruby script for building phylogenetic lineages from EBI's taxonomy.dat

Taxonomy.dat is a large, publicly available list of organisms. The file is available from the European Bioinformatics Institute (EBI). It contains over 400,000 species:

[A sample record in Taxonomy.dat]

[ID : 50]
[PARENT ID : 49]
[RANK: genus]
[GC ID : 11]
[SCIENTIFIC NAME : Chondromyces]
[SYNONYM : Polycephalum]
[SYNONYM : Myxobotrys]
[SYNONYM : Chondromyces Berkeley and Curtis 1874]
[SYNONYM : "Polycephalum" Kalchbrenner and Cooke 1880]
[SYNONYM : "Myxobotrys" Zukal 1896]
[MISSPELLING : Chrondromyces]

The taxonomy.dat file exceeds 100 megabytes in length.

The taxonomy.dat file is available for public download through anonymous ftp.


Information about the taxonomy.dat file is found at:


Notice that the sample entry (above) provides an ID number for the entry organism, Chondromyces, and for it's parent class. Since every organism and class has a parent, you can write a script that reconstructs the full phylogenetic lineage for any entry in taxonomy.dat.

In this blog, I include a Ruby script that parses through taxonomy.dat, builds a hash of all the child-parent relationships, then re-parses the file, building the phylogenetic lineage of each organism using the child-parent hash that was built in the first pass.

This Ruby script is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

A copy of the Ruby script is available at:

Lately, I've gotten into the habit of creating Perl and Python versions of my Ruby scripts, and these are also available at the web site. All three scripts operate at about the same speed. It takes under a minute to execute these scripts on a desktop computer running at 2.6 MHz with 512 MByte RAM. You may need this much RAM to provide memory for the hash (of child-parent relationships).

intext ="taxonomy.dat", "r")
outtext ="taxo.txt", "w")
parenthash =
namehash =
intext.each_line("//") do
line =~ /\nID\s+\:\s*([0-9]+)\s*\n/
child_id = $1
line =~ /\nPARENT ID\s+\:\s*([0-9]+)\s*\n/
parent_id = $1
parenthash[child_id] = parent_id
line =~ /\nSCIENTIFIC NAME\s+\:\s*([^\n]+)\s*\n/
scientific_name = $1
namehash[child_id] = scientific_name
intext ="taxonomy.dat", "r")
intext.each_line("//") do
getline = line
outtext.puts(getline, "HIERARCHY")
line =~ /\nID\s+\:\s*([0-9]+)\s*\n/
id_name = $1
(1..30).each do
id_name = parenthash[id_name]
break if namehash[id_name].nil?

The script produces an output file, taxo.txt that exceeds 224 Megabytes in length. The output consists of the taxonomic entries from taxonomy.dat, along with the phylogentic lineage for each organism.

An sample ancestral lineage, for "banana" is:

ID : 89151
PARENT ID : 4640
RANK : species
GC ID : 1
MGC ID : 1
SCIENTIFIC NAME : Musa x paradisiaca
SYNONYM : Musa paradisiaca
SYNONYM : Musa sapientum
SYNONYM : Musa acuminata x Musa balbisiana
SYNONYM : Musa x paradisiaca L.
SYNONYM : Musa x sapientum L.
SYNONYM : Musa x sapientum
COMMON NAME : banana
INCLUDES : Musa sp. RN-2001
MISSPELLING : Musa lactan
Musa x paradisiaca
cellular organisms

A web site that produces the phylogeny of any entered species (in taxonomy.dat) is available at:

- Jules Berman
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, ruby programming language, python programming language, perl programming language, phylogeny, taxonomy, taxa, taxon, ancestral lineage, classification