Tuesday, April 15, 2008

Python script to extract phylogenetic lineages using the EBI taxonomy

In yesterday's blog, I discussed a Ruby script for creating phylogentic lineages for species included in taxonomy.dat. Here is the equivalent project, in Python.

Taxonomy.dat is a large, publicly available list of organisms. The file is available from the European Bioinformatics Institute (EBI). It contains over 400,000 species:

[A sample record in Taxonomy.dat]

ID : 350094
PARENT ID : 343736
RANK : species
GC ID : 1
MGC ID : 5
SCIENTIFIC NAME : Omalisus fontisbellaquei
MISSPELLING : Omalisus fontisbellaquaei
MISSPELLING : Omalisis fontisbellaguei

The taxonomy.dat file exceeds 100 megabytes in length.

The taxonomy.dat file is available for public download through anonymous ftp.


Information about the taxonomy.dat file is found at:


Notice that the sample entry (above) provides an ID number for the entry organism, and for it's parent class. Since every organism and class has a parent, you can write a script that reconstructs the full phylogenetic lineage for any entry in taxonomy.dat.

In this blog, I include a Python script that parses through taxonomy.dat, builds a hash of all the child-parent relationships, then re-parses the file, building the phylogenetic lineage of each organism using the child-parent hash that was built in the first pass.

This Python script is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

A copy of the Python script is available at: http://www.julesberman.info/taxon.htm.

It takes under a minute to execute this script on a desktop computer running at 2.6 MHz with 512 MByte RAM. You may need this much RAM to provide memory for the hash (of child-parent relationships).

import re
intext = open("taxonomy.dat", "r")
outtext = open("taxo.txt", "w")
parenthash = {}
namehash = {}
cum_line = ""
childnumber = ""
parentnumber = ""
child_match = re.compile('ID\s+\:\s*(\d+)\s*')
parent_match = re.compile('PARENT ID\s+\:\s*(\d+)\s*')
name_match = re.compile('SCIENTIFIC NAME\s+\:\s*([^\n]+)\s*')
end_match = re.compile('\/\/')
for line in intext:
p = end_match.search(line)
if p:
m = child_match.search(cum_line)
if m:
childnumber = m.group(1)
x = parent_match.search(cum_line)
if x:
parentnumber = x.group(1)
parenthash[childnumber] = parentnumber
y = name_match.search(cum_line)
if y:
scientific_name = y.group(1)
namehash[childnumber] = scientific_name
#print childnumber + " " + namehash[childnumber] + " " + parenthash[childnumber]
cum_line = ""
cum_line = cum_line + line
cum_line = ""
intext = open("taxonomy.dat", "r")
for line in intext:
p = end_match.search(line)
if p:
print>>outtext, cum_line + "HIERARCHY"
z = child_match.search(cum_line)
if z:
id_name = z.group(1)
for i in range(30):
if namehash.has_key(id_name):
print>>outtext, namehash[id_name]
if parenthash.has_key(id_name):
id_name = parenthash[id_name]
print>>outtext, "//"
cum_line = ""
cum_line = cum_line + line
cum_line = ""

The script produces an output file, taxo.txt that exceeds 224 Megabytes in length. The output consists of the taxonomic entries from taxonomy.dat, along with the phylogentic lineage for each organism.

An sample ancestral lineage, for "bison" is:

ID : 9900
PARENT ID : 27592
RANK : genus
GC ID : 1
MGC ID : 2
Fungi/Metazoa group
cellular organisms

A web site that produces the phylogeny of any entered species (in taxonomy.dat) is available at: http://www.julesberman.info/post.htm

- Jules Berman

tags: python programming language, phylogeny, taxonomy, taxa, taxon, ancestral lineage, classification, phylogenetics, python script, scripting language, species, phylum, genus
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.