Wednesday, July 4, 2007

Unclassified terms in the Neoplasm Classification

The gzipped version (de-compress with gunzip utility) of the Developmental Lineage Classification and Taxonomy of Neoplasms is available for public download.

The total number of included cancer-related terms exceeds 146,000.

In addition to (and following within the file) the list of classified neoplasm terms is a list of unclassified cancer related terms (all identified by the same identifier, "C0000000").

This list of unclassified terms consists of general cancer terms that do not specify any particular neoplasm; overly specific terms that provide so-call pre-coordinated annotations related to terms contained elsewere in the Classification; and valid terms that have not been added (yet) to the list of classified neoplasm terms.

Examples of non-specific cancer-related terms are:

-borderline tumor
-mucinous tumor
-blast crisis
-preinvasive carcinoma

Examples of overly specific terms are:

-squamous carcinoma of the nasal vestibule
-gastric non-hodgkin lymphoma of mucosa-associated lymphoid tissue
-primary primitive neuroectodermal tumor of the kidney

The terms that are currently unclassified and are awaiting inclusion in the classified section were added by putting curated candidate terms in an external file and parsing these candidate terms with a Perl script that checks to see if they are already in the Classification and that automatically assigns them a "C0000000" code if they are new. The Perl script,, is one of many "helper" scripts that the curator uses to facilitate growth of the classification. It is shown here:

#This Perl script was created by Jules J. Berman and is entered
#into the Public Domain
#The software is provided "as is", without warranty of any kind,
#express or implied, including but not limited to the warranties
#of merchantability, fitness for a particular purpose and
#noninfringement. in no event shall the authors or copyright
#holders be liable for any claim, damages or other liability,
#whether in an action of contract, tort or otherwise, arising
#from, out of or in connection with the software or the use or
#other dealings in the software.
open (TEXT,"neocl.xml")||die"Cannot";
my $line = " ";
my %doubhash;
while ($line ne "")
$line = <TEXT>;
next if ($line !~ /C[0-9]{7}/);
$line =~ /\"\> ?(.+) ?\<\//;
$phrase = $1;
close TEXT;
open (TEXT,"newneocl.txt")||die"Cannot";
open (OUT,">new.out")||die"Cannot";
my $key = " ";
while ($key ne "")
$key = <TEXT>;
$key =~ s/\n//;
next if ($key eq "");
if (exists $doubhash{$key})
print "$key already exists\n";
print OUT "\ print OUT "\= \"C0000000\"\>";
print OUT "$key\<\/name\>\n";

-Jules J. Berman
