Saturday, July 7, 2007

Synonymy in the Neoplasm Classification

The Developmental Lineage Classification and Taxonomy of Neoplasms contains (on 7/7/07), 130,359 classified neoplasm terms distributed over 5,826 concepts, yielding an average exceeding 20 terms per concept. An example are the synonymous terms for adenocarcinoma of the prostate.

prostate with adenoca
adenoca arising in prostate
adenoca involving prostate
adenoca arising from prostate
adenoca of prostate
adenoca of the prostate
prostate with adenocarcinoma
adenocarcinoma arising in prostate
adenocarcinoma involving prostate
adenocarcinoma arising from prostate
adenocarcinoma of prostate
adenocarcinoma of the prostate
adenocarcinoma arising in the prostate
adenocarcinoma involving the prostate
adenocarcinoma arising from the prostate
prostate with ca
ca arising in prostate
ca involving prostate
ca arising from prostate
ca of prostate
ca of the prostate
prostate with cancer
cancer arising in prostate
cancer involving prostate
cancer arising from prostate
cancer of prostate
cancer of the prostate
cancer arising in the prostate
cancer involving the prostate
cancer arising from the prostate
prostate with carcinoma
carcinoma arising in prostate
carcinoma involving prostate
carcinoma arising from prostate
carcinoma of prostate
carcinoma of the prostate
carcinoma arising in the prostate
carcinoma involving the prostate
carcinoma arising from the prostate
prostate adenoca
prostate adenocarcinoma
prostate ca
prostate cancer
prostate carcinoma
prostatic cancer
prostatic carcinoma
prostatic adenocarcinoma
prostate gland adenocarcinoma
adenocarcinoma of the prostate gland
adenocarcinoma of prostate gland
prostate gland carcinoma
carcinoma of the prostate gland
carcinoma of prostate gland

This kind of synonymy is needed if you want to implement autocoding software that will successfully capture all of the cancer terms included in a sampled text.


In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, rare disease, medical terminology, medical transcription, nomenclature, terminology, pitfalls in medical terminology

Wednesday, July 4, 2007

Unclassified terms in the Neoplasm Classification

The gzipped version (de-compress with gunzip utility) of the Developmental Lineage Classification and Taxonomy of Neoplasms is available for public download.

The total number of included cancer-related terms exceeds 146,000.

In addition to (and following within the file) the list of classified neoplasm terms is a list of unclassified cancer related terms (all identified by the same identifier, "C0000000").

This list of unclassified terms consists of general cancer terms that do not specify any particular neoplasm; overly specific terms that provide so-call pre-coordinated annotations related to terms contained elsewere in the Classification; and valid terms that have not been added (yet) to the list of classified neoplasm terms.

Examples of non-specific cancer-related terms are:

-borderline tumor
-mucinous tumor
-blast crisis
-preinvasive carcinoma
-dysplasia

Examples of overly specific terms are:

-squamous carcinoma of the nasal vestibule
-gastric non-hodgkin lymphoma of mucosa-associated lymphoid tissue
-primary primitive neuroectodermal tumor of the kidney

The terms that are currently unclassified and are awaiting inclusion in the classified section were added by putting curated candidate terms in an external file and parsing these candidate terms with a Perl script that checks to see if they are already in the Classification and that automatically assigns them a "C0000000" code if they are new. The Perl script, addterm.pl, is one of many "helper" scripts that the curator uses to facilitate growth of the classification. It is shown here:


#!/usr/local/bin/perl
#addterm.pl
#
#This Perl script was created by Jules J. Berman and is entered
#into the Public Domain
#
#The software is provided "as is", without warranty of any kind,
#express or implied, including but not limited to the warranties
#of merchantability, fitness for a particular purpose and
#noninfringement. in no event shall the authors or copyright
#holders be liable for any claim, damages or other liability,
#whether in an action of contract, tort or otherwise, arising
#from, out of or in connection with the software or the use or
#other dealings in the software.
#
open (TEXT,"neocl.xml")||die"Cannot";
my $line = " ";
my %doubhash;
while ($line ne "")
{
$line = <TEXT>;
next if ($line !~ /C[0-9]{7}/);
$line =~ /\"\> ?(.+) ?\<\//;
$phrase = $1;
$doubhash{$phrase}="";
}
close TEXT;
open (TEXT,"newneocl.txt")||die"Cannot";
open (OUT,">new.out")||die"Cannot";
my $key = " ";
while ($key ne "")
{
$key = <TEXT>;
$key =~ s/\n//;
next if ($key eq "");
if (exists $doubhash{$key})
{
print "$key already exists\n";
}
else
{
print OUT "\ print OUT "\= \"C0000000\"\>";
print OUT "$key\<\/name\>\n";
}
}
exit;

-Jules J. Berman
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Sunday, July 1, 2007

Neoplasm classification structural validation

In yesterday's post, I announced the newest version of the Developmental Lineage Classification and Taxonomy of Neoplasms (also called the Neoplasm Classification).

When you have a nomenclature that contains hundreds of thousands of terms, and when new versions of the nomenclature are regularly released, you need computational methods to check the internal consistency of the nomenclature. The classification is in XML, and this makes it easy to write a multi-purpose parsing script.

The Perl script (below) has three purposes:

1. It checks that neocl.xml is well-formed xml

2. It checks that a concept identifying code in one class is not repeated in any
other class within neocl.xml

3. It checks that no term in neocl.xml is ever repeated

On my 2.5 GHz computer, the xmlvocab.pl Perl script takes about 4 seconds to parse and check the 10+ Megabyte neocl.xml file. The script provides messages indicating any problem terms in the nomenclature.

#!/usr/bin/perl
#xmlvocab.pl
#
#This Perl script was created by Jules J. Berman and
#updated on 5/19/2005
#
#Copyright (c) 2005 Jules J. Berman
#
#Permission is granted to copy, distribute and/or
#modify this document
#under the terms of the GNU Free Documentation
#License, Version 1.2
#or any later version published by the Free
#Software Foundation;
#with no Invariant Sections, no Front-Cover Texts,
#and no Back-Cover Texts.
#
#The software is provided "as is", without warranty
#of any kind, express or implied, including but not
#limited to the warranties of merchantability,
#fitness for a particular purpose and
#noninfringement. in no event shall the authors
#or copyright holders be liable for any claim, damages
#or other liability, whether in an action of contract,
#tort or otherwise, arising from, out of or in connection
#with the software or the use or other dealings in the
#software.
#
#An explanation of the classification can be found in
#the following two publications, which should be cited
#in any publication or work that may result from any
#use of this file.
#
#Berman JJ. Tumor classification: molecular analysis
#meets Aristotle. BMC Cancer 4:8, 2004.
#
#neocl.xml is the classification of all neoplastic lesions.
#
use XML::Parser;
my $parser = XML::Parser->new( Handlers => {
Init => \&handle_doc_start,
Final => \&handle_doc_end,
});
$file = "neocl.xml";
$parser -> parsefile($file);

sub handle_doc_start
{
print "\nBeginning to parse $file now\n";
}

sub handle_doc_end
{
print "\nFinished. $file is a well-formed XML File.\n";
}

open (TEXT, $file);
#open (OUT,">neocl.out");
my $countcode = 0;
my $line = " ";
my %code;
my $classname;
my $phrasecount = 0;
while ($line ne "")
{
$line = <TEXT>;
next unless ($line =~ /\<.+\>/);
if ($line =~ /^\<([a-z\_]+)\>/)
}
$classname = $1;
next;
}
if ($line =~ /[CS]([0-9]{7})/)
{
$phrasecount++;
if (exists $code{$&})
{
if ($code{$&} ne $classname)
{
print "$& is a problem\n";
}
}
else
{
$code{$&} = $classname;
$countcode++;
}
}
}
close TEXT;
print "The total number of concepts is $countcode\n";
print "The total number of phrases is $phrasecount\n";
open (TEXT, $file);
undef %code;
$line = " ";
my %item;
while ($line ne "")
{
$line = <TEXT>;
if ($line =~ /([CS])([0-9]{7})/)
{
$prefix = $1;
$line =~ /\"\> ?(.+) ?\<\//;
$phrase = $prefix . $1;
if (exists $item{$phrase})
{
print $. . " More than one occurrence of \"$phrase\"\n";
}
$item{$phrase}="";
}
}
close TEXT;
exit;

-Jules J. Berman

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, genetics of complex disease, genetics of common diseases, cryptic disease