Saturday, November 22, 2008

Using SEER Public-Use Data: 8

I have been writing a series of blogs on SEER, the U.S. National Cancer Institute's Surveillance, Epidemiology and End Results program. SEER is an amazing resource for information on the cancers that occur in the U.S. One of the products of SEER is the Public Use dataset, which contains de-identified records on over 3.5 million cancers that have occurred between 1973 and 2005.

When you have 3.5 million cancer cases to study, you can draw certain types of inferences that could not possibly be made with the data accumulated at a single medical institution.

Today, we'll look at the neoplasms that occur in appendix of the colon.

Here are the SEER listings. The left-hand column is the average age of occurrence of each neoplasm. The middle column is the number of cases in the SEER collection (neoplasms with fewer than 20 SEER cases were considered un-informative and were omitted from the list). The column on the right is the ICD-O term.

Age of Neoplasm name
039 022 carcinoid tumor, argentaffin, malignant
040 434 carcinoid tumor, malignant
050 217 adenocarcinoid tumor
054 224 mucocarcinoid tumor, malignant
055 038 composite carcinoid
058 022 carcinoma nos
058 138 signet ring cell carcinoma
059 186 mucin-producing adenocarcinoma
060 640 mucinous adenocarcinoma
061 175 mucinous cystadenocarcinoma nos
063 649 adenocarcinoma nos
065 033 adenocarcinoma in tubulovillous adenoma
067 063 adenocarcinoma in villous adenoma

Though many different kinds of malignant neoplasms can occur in the appendix (and can be found in the SEER data), only carcinoids and adenocarcinomas occur frequently.

All of the carcinoid tumors cluster within a younger average age of occurrence than the adenocarcinomas.

This tells us a few things:

1. All of the carcinoids are biologically related to each other.

2. The carcinoids have a different developmental history than the adenocarcinomas.

3. When a pathologist sees a focus of adenocarcinoma in an appendiceal tumor, particularly in a young or middle-aged patient, he or she should carefully look for a focus of carcinoid, because the tumor might be a mixed adenocarcinoid tumor.

This is an example of how to use the SEER data to examine and test existing hypotheses and to develop new hypotheses. It took under a minute to generate the table, using a Perl script that parsed through 3.5 million SEER records.

In a prior blog, I discussed some of the simple Perl routines used in the SEER-data parsing algorithms, and these are available from my web site.

If you want to do creative data mining, you will need to learn a little computer programming.

For Perl and Ruby programmers, methods and scripts for using SEER and other publicly available biomedical databases, are described in detail in my prior books:

Perl Programming for Medicine and Biology

Ruby Programming for Medicine and Biology

An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.

More information on cancer is available in my recently published book, Neoplasms.

