Friday, December 12, 2008

Latest version of Neoplasms Classification now available

I'm interrupting my series of blogs on the CDC public use mortality data sets to announce the release of the most recent version of the Developmental Lineage Classification and Taxonomy of Neoplasms.

The current classification contains 5986 neoplasm concepts (types of neoplasms) classified under 127,723 terms. It also contains a large number of unclassified neoplasm terms as addendum items. It is, by far and away, the world's largest neoplasm nomenclature.

The classification is available in XML, RDF and flat-file formats. Here is the preface text distributed with each formatted version:

"This file was prepared by Jules J. Berman. The first version of this file was created November 15, 2003. The current version was created on December 12, 2008.

Copyright (c) 2003-2008 Jules J. Berman

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is available.

The neoclxml file is provided "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

An explanation of the classification can be found in the following two publications, which should be cited in any publication or work that may result from any use of this file.

Berman JJ. Tumor classification: molecular analysis meets Aristotle. BMC Cancer 4:8, 2004.

Berman JJ. Neoplasms: Principles of Development and Diversity. Jones and Bartlett Publishers, Sudbury, MA, 2009.

In the Neoplasm Classification, all classified names of neoplasms are coded with a "C" followed by a 7 digit number other than 0000000.

For example, "C9168000" = rectal signet ring adenocarcinoma

In addition to classified terms, there are three groups of unclassified terms that are provided special items that follow the list of classified terms in this file.

"S" followed by 7 digits
"ST" followed by 7 digits

This list of unclassified terms coded as "C0000000" consists of general cancer terms that do not specify any particular neoplasm; overly specific terms that provide so-call pre-coordinated annotations related to terms contained elsewere in the Classification; and valid terms that have not been added (yet) to the list of classified neoplasm terms.

Examples of non-specific cancer-related terms are:

borderline tumor
mucinous tumor
blast crisis
preinvasive carcinoma

Examples of overly specific terms are:

squamous carcinoma of the nasal vestibule gastric non-hodgkin lymphoma of mucosa-associated lymphoid tissue primary primitive neuroectodermal tumor of the kidney

The terms that are coded "S" followed by 7 digits are inherited syndromes that have a neoplastic component (i.e., the occasional or frequent appearance of neoplasms in the syndrome).

The terms that are coded "ST" followed by 7 digits are staging terms used by oncologists.

The classification is meant for informatics projects that use computer parsing techniques. Programmers should simply insert statements that filter the unclassified terms included in the file."

Additional information may be available from the author's web site:

The Neoplasm Classification is available as a gzipped XML file at:

The Neoplasm Classification is available as a zipped XML file at:

The Neoplasm Classification is available as a gzipped flat file at:

The Neoplasm Classification is available as an ascii flat-file at:

The Neoplasm Classification is available as a gzipped RDF file at:

The Neoplasm Classification is available as a zipped RDF file at:

© 2008 Jules Berman
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.