Wednesday, November 19, 2008

Using SEER Public-Use Data: 6

I have been writing a series of blogs on SEER, the U.S. National Cancer Institute's Surveillance, Epidemiology and End Results program. SEER is an amazing resource for information on the cancers that occur in the U.S. One of the products of SEER is the Public Use dataset, which contains de-identified records on over 3.5 million cancers that have occurred between 1973 and 2005.

When you have 3.5 million cancer cases to study, you can draw certain types of inferences that could not possibly be made with the data accumulated at a single medical institution.

Here is an example. Precancers are the identifiable, easily-treated, lesions from which advanced cancers develop. When you eradicate the precancer, the cancer never develops.

In theory, if precancers precede cancers, the average age at diagnosis of precancers should be smaller than the average at diagnosis of the cancers that develop from precancers. This biologic tautology has been hard to verify, because there is not all precancers will develop in people of the same age. The same is true for cancers. And it's hard to come up with a large enough population that will separate the (overlapping) precancer and cancer populations.

However, the SEER population provides the numbers we need. When we extract all of the SEER neoplasms that arise in the uterine cervix, we find the following average ages for the resulting set of tumors:

Age of Neoplasm name
034 0049176 carcinoma in situ nos
034 0051906 squamous cell carcinoma in situ nos
035 0000359 sq cell carcinoma lg cell non-ker in situ
037 0000313 sq cell carcinoma keratinizing nos in situ
039 0001348 adenocarcinoma in situ
039 0018551 squamous intraepithelial neoplasia, grade iii
039 0000058 squamous cell carcinoma in situ with questionable stromal invasion
041 0003213 squamous cell carcinoma, microinvasive
043 0000113 adenocarcinoma, endocervical type
048 0001320 adenosquamous carcinoma
048 0000049 neuroendocrine carcinoma
048 0000093 large cell carcinoma nos
049 0003118 squamous cell carcinoma, large cell, nonkeratinizing type
050 0002524 carcinoma nos
050 0000259 endometrioid carcinoma
050 0000021 mucinous adenocarcinoma, endocervical type
051 0004121 adenocarcinoma nos
051 0000250 small cell carcinoma nos
051 0002727 squamous cell carcinoma, keratinizing type nos
051 0000104 squamous cell carcinoma, small cell, nonkeratinizing type
052 0000233 mucinous adenocarcinoma
052 0018774 squamous cell carcinoma nos
054 0000218 clear cell adenocarcinoma nos
055 0000088 papillary squamous cell carcinoma
056 0000023 mesodermal mixed tumor
056 0000141 mucin-producing adenocarcinoma
058 0000034 verrucous carcinoma nos
060 0000289 neoplasm, malignant
060 0000037 papillary serous cystadenocarcinoma
062 0000025 sarcoma nos
062 0000055 mullerian mixed tumor
062 0000044 carcinoma, anaplastic type nos
062 0000076 carcinoma, undifferentiated type nos
062 0000033 adenocarcinoma with squamous metaplasia
063 0000043 carcinosarcoma nos

The average age of all of the in situ lesions (i.e., non-invasive precancers) is smaller than the average age of the observed invasive cancers arising from the cervix! I have never seen this observation demonstrated from any other data set.

It took about one minute to generate the table, using a Perl script that parsed through 3.5 million SEER records.

In yesterday's blog, I discussed some of the simple Perl routines used in the SEER-data parsing algorithms.

If you want to do creative data mining, you will need to learn a little computer programming.

For Perl and Ruby programmers, methods and scripts for using SEER and other publicly available biomedical databases, are described in detail in my prior books:

Perl Programming for Medicine and Biology

Ruby Programming for Medicine and Biology

An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.

More information on cancer is available in my recently published book, Neoplasms.

© 2008 Jules Berman

key words: neoplasms, cancer, neoplasia, precancer, tumor, tumour, tumors, tumours, neoplasm, carcinogenesis, carcinogens, tumor genetics

As specified in the SEER Data Agreement, the citation for the SEER data is as follows:

"Surveillance, Epidemiology, and End Results (SEER) Program ( Limited-Use Data (1973-2005), National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch, released April 2008, based on the November 2007 submission."

As with all of my scripts, lists, web sites, and blog entries, the following disclaimer applies. This material is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the material or the use or other dealings in the material.

No comments: