Sunday, July 1, 2007

Neoplasm classification structural validation

In yesterday's post, I announced the newest version of the Developmental Lineage Classification and Taxonomy of Neoplasms (also called the Neoplasm Classification).

When you have a nomenclature that contains hundreds of thousands of terms, and when new versions of the nomenclature are regularly released, you need computational methods to check the internal consistency of the nomenclature. The classification is in XML, and this makes it easy to write a multi-purpose parsing script.

The Perl script (below) has three purposes:

1. It checks that neocl.xml is well-formed xml

2. It checks that a concept identifying code in one class is not repeated in any
other class within neocl.xml

3. It checks that no term in neocl.xml is ever repeated

On my 2.5 GHz computer, the xmlvocab.pl Perl script takes about 4 seconds to parse and check the 10+ Megabyte neocl.xml file. The script provides messages indicating any problem terms in the nomenclature.

#!/usr/bin/perl
#xmlvocab.pl
#
#This Perl script was created by Jules J. Berman and
#updated on 5/19/2005
#
#Copyright (c) 2005 Jules J. Berman
#
#Permission is granted to copy, distribute and/or
#modify this document
#under the terms of the GNU Free Documentation
#License, Version 1.2
#or any later version published by the Free
#Software Foundation;
#with no Invariant Sections, no Front-Cover Texts,
#and no Back-Cover Texts.
#
#The software is provided "as is", without warranty
#of any kind, express or implied, including but not
#limited to the warranties of merchantability,
#fitness for a particular purpose and
#noninfringement. in no event shall the authors
#or copyright holders be liable for any claim, damages
#or other liability, whether in an action of contract,
#tort or otherwise, arising from, out of or in connection
#with the software or the use or other dealings in the
#software.
#
#An explanation of the classification can be found in
#the following two publications, which should be cited
#in any publication or work that may result from any
#use of this file.
#
#Berman JJ. Tumor classification: molecular analysis
#meets Aristotle. BMC Cancer 4:8, 2004.
#
#neocl.xml is the classification of all neoplastic lesions.
#
use XML::Parser;
my $parser = XML::Parser->new( Handlers => {
Init => \&handle_doc_start,
Final => \&handle_doc_end,
});
$file = "neocl.xml";
$parser -> parsefile($file);

sub handle_doc_start
{
print "\nBeginning to parse $file now\n";
}

sub handle_doc_end
{
print "\nFinished. $file is a well-formed XML File.\n";
}

open (TEXT, $file);
#open (OUT,">neocl.out");
my $countcode = 0;
my $line = " ";
my %code;
my $classname;
my $phrasecount = 0;
while ($line ne "")
{
$line = <TEXT>;
next unless ($line =~ /\<.+\>/);
if ($line =~ /^\<([a-z\_]+)\>/)
}
$classname = $1;
next;
}
if ($line =~ /[CS]([0-9]{7})/)
{
$phrasecount++;
if (exists $code{$&})
{
if ($code{$&} ne $classname)
{
print "$& is a problem\n";
}
}
else
{
$code{$&} = $classname;
$countcode++;
}
}
}
close TEXT;
print "The total number of concepts is $countcode\n";
print "The total number of phrases is $phrasecount\n";
open (TEXT, $file);
undef %code;
$line = " ";
my %item;
while ($line ne "")
{
$line = <TEXT>;
if ($line =~ /([CS])([0-9]{7})/)
{
$prefix = $1;
$line =~ /\"\> ?(.+) ?\<\//;
$phrase = $prefix . $1;
if (exists $item{$phrase})
{
print $. . " More than one occurrence of \"$phrase\"\n";
}
$item{$phrase}="";
}
}
close TEXT;
exit;

-Jules J. Berman

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, genetics of complex disease, genetics of common diseases, cryptic disease