Wednesday, January 28, 2009

Update of Neoplasm Classification is now available

I'm interrupting my series of blogs on bimodal cancer age distributions to announce the release of the most recent version of the Developmental Lineage Classification and Taxonomy of Neoplasms.

The current classification contains 6083 neoplasm concepts (types of neoplasms) classified under 122,698 terms. It also contains a large number of unclassified neoplasm terms as addendum items. It is, by far and away, the world's largest neoplasm nomenclature.

The classification is available in XML, RDF and flat-file formats. Here is the preface text distributed with each formatted version:

"This file was prepared by Jules J. Berman. The first version of this file was created November 15, 2003. The current version was created on January 27, 2009.

Copyright © 2003-2009 Jules J. Berman

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is available.

The neoclxml file is provided "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

An explanation of the classification can be found in the following two publications, which should be cited in any publication or work that may result from any use of this file.

Berman JJ. Tumor classification: molecular analysis meets Aristotle. BMC Cancer 4:8, 2004.

Berman JJ. Neoplasms: Principles of Development and Diversity. Jones and Bartlett Publishers, Sudbury, MA, 2009.

In the Neoplasm Classification, all classified names of neoplasms are coded with a "C" followed by a 7 digit number other than 0000000.

For example, "C9168000" = rectal signet ring adenocarcinoma

In addition to classified terms, there are three groups of unclassified terms that are provided special items that follow the list of classified terms in this file.

"C0000000"
"S" followed by 7 digits
"ST" followed by 7 digits

This list of unclassified terms coded as "C0000000" consists of general cancer terms that do not specify any particular neoplasm; overly specific terms that provide so-call pre-coordinated annotations related to terms contained elsewere in the Classification; and valid terms that have not been added (yet) to the list of classified neoplasm terms.

Examples of overly specific terms are:

squamous carcinoma of the nasal vestibule, gastric non-hodgkin lymphoma of mucosa-associated lymphoid tissue, primary primitive neuroectodermal tumor of the kidney

The terms that are coded "S" followed by 7 digits are inherited syndromes that have a neoplastic component (i.e., the occasional or frequent appearance of neoplasms in the syndrome).

The terms that are coded "ST" followed by 7 digits are staging terms used by oncologists.

The classification is meant for informatics projects that use computer parsing techniques. Programmers should simply insert statements that filter the unclassified terms included in the file."

Additional information may be available from the author's web site:
http://www.julesberman.info/devclass.htm

The Neoplasm Classification is available as a zipped XML file at:
http://www.julesberman.info/neoclxml.zip

The Neoplasm Classification is available as a zipped flat file at:
http://www.julesberman.info/neoself.zip

The Neoplasm Classification is available as a zipped RDF file at:
http://www.julesberman.info/neordf.zip

© 2009 Jules Berman

key words: medical nomenclature, classification, rdf, xml, ontology, data mining, ontology, science
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Friday, January 9, 2009

Medical importance of bimodal cancers

My January 2 blog, I introduced the subject of bimodal cancers. These are cancers that have two peaks in occurrences, by age. In the blog, I included images of the type of age-distribution graphs seen with bimodal and multimodal cancers.

Examples of recognized bimodal cancers are Hodgkin lymphoma (which has two peaks in occurrence: in young adults and in middle-aged adults), and Kaposi's sarcoma (which has two peaks in occurrence: in young people, with AIDS, and in older men, unassociated with AIDS).

The shape of the curve of cancer occurrences, by age, for the different types of cancer, is a fascinating puzzle. If we understand why some cancer curves are bimodal, we can enhance our knowledge of carcinogenesis (the developmental process of cancer) and tumor diagnosis (the features that identify a cancer and that separate a particular type of cancer from all other types of cancer). We can also learn a lot about the meaning of the data that we collect on cancers, and the ways that this data can be analyzed. Most importantly, the insights gained can save lives, by uncovering preventable cancers, and by finding new classes and subclasses of cancer that may benefit from innovative cancer treatments.

Here are the causes for cancer multimodality (multiple peaks in a graph of cancer occurrences by age)

1. Multiple environmental causes targeting different ages
2. Multiple genetic causes with different latencies
3. Multiple diseases classified under one name
4. Faulty or insufficient data
5. Combinations of 1,2,3 and 4

We see examples of all of these possibilities, in the SEER data, and in previously published studies of specific tumors.

We know that specific exposures to a site-specific carcinogen can create spikes in the occurrence of cancers in a particular subpopulation. For example, high-school boys who play baseball sometimes chew tobacco. It helps them maintain focus on their game, and it gives them something to do when they're sitting in the batter's cage. They typically have a favorite spot in the mouths, between the cheek and the gum, where they stick their "chaw". This is the most likely spot for cancer to occur. Cancers caused by chewing tobacco may occur in teen-agers and young adults. A specific type of high-risk behavior, such as tobacco chewing, can create an early peak in incidence for a tumor that normally occurs in a much older age group.

Some tumors have genetic and non-genetic (sporadic) causes. The best-studied example is probably retinoblastoma. Some people are born with mutations that predispose them to develop retinoblastoma. These people typically develop tumors at a very early age. Those who develop retinoblastoma without the inborn genetic mutation [who acquire mutations later in life] typically develop retinoblastoma at a later age.

We also see multimodal distributions when we mistakenly call several different kinds of cancer by the same name. For example, lung cancer in young persons may have a specific mutation that distinguishes it from lung cancer occurring in an older population (Midline carcinoma of Children and Young adults has a characteristic gene arrangement involving the NUT gene). This cancer is separatble from bronchogenic carcinoma of the lung, occurring in older persons. It may turn out that lung cancer of the young may respond to a different treatment than lung cancers caused by smoking.

Finally, we must consider that it is possible that the multimodal curves are simply an artifact produced by the way we collect and analyze data. If the pathologists who rendered the diagnoses, used in the SEER data set, were wrong (i.e., rendered misdiagnoses), we would expect multimodality on that basis (representing the different tumors included under a category that should have included only one kind of cancer).

This actually happens. The best example is malignant fibrous histiocytoma. Current thinking is that this diagnostic entity has been used as a a grab-bag diagnosis for sarcomas that do not fit well into any particular category. There is substantial evidence that many cases of malignant fibrous histiocytoma would have been better diagnosed as leiomyosarcomas or liposarcomas or fibrosarcomas, and a host of rare sarcomas, each with its own characteristic age distribution. By blending these different tumors under a single name, you also blend the age distributions of the reported population.

I prepared a document of bimodal cancer distributions (raw data, normalized data, and graphs). In this document, data on each tumor of a given name was collected, without pre-stratifying tumors based on gender, ethnicity, or anatomic site. Had we done so, we might have found that what we thought was a single tumor may have contained several different tumors (e.g., medullary carcinoma of breast and medullary carcinoma of thyroid). The artifactual aggregation of different tumors under a single name by ignoring well-known distinguishing demographic or anatomic factors, is a potential source of confusion. In later blogs, we'll see some simple ways of eliminating obvious sources of error from our analyses of bimodal populations.

Whitley and Ball have discussed a number of reasons, related to the collection of data, for multimodal peaks.

Elise Whitley Jonathan Ball. Statistics review 1: Presenting and summarising data. Crit Care. 6:66-71, 2002.

"a (bimodal)distribution with two peaks may actually be a combination of two uni-modal distributions (such as hormone levels in men and women). Alternatively, a (multimodal) distribution with multiple peaks may be due to digit preference (rounding observations up or down) during data collection, where peaks appear at round numbers, for example peaks in systolic blood pressure at 90, 100, 110, 120 mmHg, and so on."

Despite these considerations, there are many reasons to believe that many of the the bimodal distributions, found in the SEER data sets, reveal true biological features of the cancer populations.

Reasons why the SEER bimodal graphs are non-artifactual

1. The multimodal peaks are rare among cancers. Of the more than 650 cancers collected in the complete file of cancer occurrences by age, only a couple dozen show multmodality. If there were a consistent error in the way that data were collected, would you not expect to see the same error in the majority of cancer distributions?

2. The SEER data reproduces multimodal peaks in the same cancers for which multimodal peaks have been established from other data sources. For example, the SEER data shows bimodal peaks for Hodgkin lymphoma, Kaposi sarcoma, and secretory carcinoma of the breast.

3. The SEER data provides very large numbers of cases for many of the cancers for which bimodal peaks are found. The shape of the curves cannot be attributed to sparse data, in these cases.

4. As we will see in future blog posts, when we examine the standard devation of the bimodal peaks, and their modes, statistical analysis rejects the null hypotheses (that the observations can be accounted for with by a single population).

5. We will also see that there is internal consistency of the observation of multimodality within the SEER data. In some cases, data is collected, within SEER, on a single tumor, under different names (for example the borderline tumors of the ovary are listed under several closely related terms, as are craniopharyngiomas). In these cases, multimodality is preserved among the same type of cancer, even when the data is collected under different terms.

The persistent message is that multimodality in a cancer distribution is a puzzle worth investigating.

-© 2009 Jules J. Berman

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: orphan disease, orphan drugs, rare disease, disease genetics, genetics, bimodal, epidemiology, neoplasms, seer, pathogenesis, subsets of disease

Monday, January 5, 2009

Corrections site for Neoplasms book

For any readers who have a copy of my book, Neoplasms: Principles of Development and Diversity, I've made a web site devoted to corrections for the book.

So far, all of the corrections are small. If you find any additional errors, please notify me through a blog comment.

- Jules Berman

Friday, January 2, 2009

Cancers with two peaks in age distribution

To the casual observer, it doesn't make much difference whether a cancer has two peak in its age distribution or one peak. I'm devoting several blog posts to trying to convince readers that the distinction is very important, often implying that what we thought was one type of cancer is actually two different cancers, with similar morphology but with distinctive clinical features and different methods of treatment. Furthermore, several dozen of these cancers with two peaks in their age distribution, would not be discernible without a large cancer data set, such as the public use data files provided by the SEER (the U.S. National Cancer Institutes Surveillance, Epidemiology and End Results) project.

For this discussion, I've uploaded two pdf files

The first file is intended to be a resource for pathologists, epidemiologists and cancer researchers. It contains about 650 neoplasms, each with its age distribution. Details of the graphic represenations of the data are available in the file.

http://www.julesberman.info/seerdist.pdf


Most tumors have a simple, smooth age distribution, with a single peak. The graph of cancer rates (not occurrences), normalized against the age distribution of a standard U.S. population, often produces highest rates at the upper age range (because cancers in the elderly occur within a relatively small population of super-annuated individuals).


A typical cancer with one peak visible on a graph of occurrences (top) and of rates of occurrence normalized against a standard U.S. population (bottom). Click to see larger image.

Not all cancers have a single age-of-occurrence peak. Some have two or more peaks of occurrence.

I've provided a file that lists cancers with multimodal age distributions (i.e., more than one peak in the age distribution for the neoplasm):

http://www.julesberman.info/bimode.pdf

There a about two dozen such cancers (out of about 650 listed in the seerdist.pdf file). Here are two sample pages from the bimode.pdf file:


Cancers with two peaks. Pairs of graphs are 1) occurrences by age and 2) normalized rates. Click to see larger image

By examining these two files and by correlating these observations with known features of the included cancers, we can draw important conclusions about neoplasms in general and the bimodal tumors, in particular.

Next blog on this topic.

-© 2009 Jules J. Berman
In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site.

tags: common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, pathology, epidemiology, neoplasms

Thursday, January 1, 2009

Updated and new files on neoplasm occurrences, by age

Happy New Year!

I've just uploaded a new version of my previously published file on the age distribution of occurrences for 626 different types of cancers.

http://www.julesberman.info/seerdist.pdf

This file is intended to be a resource for pathologists, epidemiologists and cancer researchers.

I've also uploaded a new file on cancers with multimodal age distributions (i.e., more than one peak in the age distribution for the neoplasm).

http://www.julesberman.info/bimode.pdf

I'll be discussing this file in the next several blog posts.

-© 2009 Jules Berman
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.