Specified Life: Using SEER Public Use Data: 5

Tuesday, November 18, 2008

Using SEER Public Use Data: 5

SEER is the U.S. National Cancer Institute's Surveillance, Epidemiology and End Results program. It is an amazing resource for information about the cancers that occur in the U.S. One of the products of SEER is the Public Use dataset, which contains de-identified records on over 3.5 million cancers that have occurred between 1973 and 2005.

When you have 3.5 million cancer cases to study, you can draw certain types of inferences that could not possibly be made with the data accumulated at a single medical institution.

The SEER public-use data files are available as a DVD or they can be downloaded from the web.

Information for obtaining these files is available at:

http://seer.cancer.gov/data/

To get these files, you need to fax a signed agreement to the National Library of Medicine and wait for a response (it just took a few hours when I tried a week or two ago).

For this exercise, I downloaded the following file:

SEER_1973_2005_CD2.d08042008.zip (199,212,317 bytes)

Expanded, it produced several directories of files. The raw data records are contained in the following files.


08/04/2008  11:09 AM       143,352,097 BREAST.TXT
08/04/2008  11:09 AM       109,510,121 COLRECT.TXT
08/04/2008  11:09 AM        67,313,323 DIGOTHR.TXT
08/04/2008  11:09 AM        93,497,187 FEMGEN.TXT
08/04/2008  11:09 AM        70,809,823 LYMYLEUK.TXT
08/04/2008  11:09 AM       119,312,494 MALEGEN.TXT
08/04/2008  11:09 AM       127,835,148 OTHER.TXT
08/04/2008  11:09 AM       129,526,418 RESPIR.TXT
08/04/2008  11:09 AM        59,133,585 URINARY.TXT
               9 File(s)    920,290,196 bytes

The SEER files comprise over 920 megabytes of text.

Each record looks something like this (note that the following is a fake and truncated record, because I didn't want to list an actual record from SEER on my web page. JB):


246000990000001521205001078191409902051986C64908130381303211

An actual record might go on for 258 characters.

Suppose we wanted to parse through every record of every file in the SEER data set.

A few lines of Perl will suffice:


#/usr/local/bin/perl
opendir(SEERDIR, "c\:\\seer") || die ("Unable to open directory");
@files = readdir(SEERDIR);
$totalcount;
closedir(SEERDIR);
chdir("c\:\\seer");
foreach $datafile (@files)
  {
  open (TEXT, $datafile);
  $line = " ";
  while ($line ne "")
    {
    $line = <TEXT>;
    $totalcount++;
    }
  close TEXT;
  }
print "\n$totalcount";
exit

The output of the script is the number "3553255" (the total number of parsed records). It will appear on your computer monitor, after about 11 seconds. That's all the time it takes (on a 2.8 GHz CPU) to parse 920 megabytes of text.

Each SEER record is a cancer case, described by a series of 258 (mostly) numbers, in byte-assigned positions, described by a data dictionary document (available at the SEER web site). Here are the first 14 items in the dictionary.


List of the data dictionary items accounting for the first 46 bytes of a SEER record

    Patient ID number 01-08
    Registry ID 09-18
    Marital Status at DX 19-19
    Race/Ethnicity 20-21
    Spanish/Hispanic Origin 22-22
    NHIA Derived Hispanic Origin 23-23
    Sex 24-24
    Age at diagnosis 25-27
    Year of Birth 28-31
    Birth Place 32-34
    Sequence Number--Central 35-36
    Month of diagnosis 37-38
    Year of diagnosis 39-42
    Primary Site 43-46

These first 14 items are the only items I have used for any of my SEER projects.

When you know the byte locations for the data dictionary entries, you can easily write a short script (I like to use Perl, Ruby, or Python) that can extract and compile data any way you wish.

Here's a short Perl script that parses through the first two records of each SEER public use file, extracting the age at diagnosis and the year of diagnosis from the first two records of each SEER file, and printing out the result to the screen.


#/usr/local/bin/perl
opendir(SEERDIR, "c\:\\seer") || die ("Unable to open directory");
@files = readdir(SEERDIR);
closedir(SEERDIR);
chdir("c\:\\seer");
foreach $datafile (@files)
  {
  next if ($datafile !~ /\.txt/i);
  open (TEXT, $datafile);
  for($i=0;$i<2;$i++)
    {
    $line = <TEXT>;
    $age_at_dx = substr($line,24,3);
    $year_of_dx = substr($line,38,4);
    print "$datafile Age $age_at_dx Year $year_of_dx\n";
    }
  close TEXT;
  }
exit

The output looks like:


BREAST.TXT Age 057 Year 1988
BREAST.TXT Age 094 Year 1979
COLRECT.TXT Age 059 Year 1973
COLRECT.TXT Age 069 Year 1989
DIGOTHR.TXT Age 062 Year 1981
DIGOTHR.TXT Age 062 Year 1978
FEMGEN.TXT Age 080 Year 2000
FEMGEN.TXT Age 067 Year 1983
LYMYLEUK.TXT Age 071 Year 1990
LYMYLEUK.TXT Age 078 Year 1990
MALEGEN.TXT Age 088 Year 1984
MALEGEN.TXT Age 082 Year 1979
OTHER.TXT Age 064 Year 1979
OTHER.TXT Age 067 Year 1983
RESPIR.TXT Age 074 Year 1977
RESPIR.TXT Age 091 Year 2004
URINARY.TXT Age 071 Year 1986
URINARY.TXT Age 074 Year 1973

The key line from the Perl script is:


    $age_at_dx = substr($line,24,3);

This pulls the string consisting of bytes 25,26, and 27 from the record. The data dictionary tells us that bytes 25-27 comprise the record's age at diagnosis.

Two of the most important data items in the record will require a little extra work, if, once extracted, you want to understand their meaning.

These are the morphology codes and the anatomic site codes.

The morphology code occupies bytes 53-57 (for ICDO-3) and bytes 48-52 (for ICDO-2), for each record.

Examples of morphology codes are:


M96783 primary effusion lymphoma
M83703 adrenal cortical carcinoma

You can download a copy of the ICDO-3 codes from SEER

If you start with a pdf file of the ICDO codes, you will need to cut and paste the pdf file into a plain ascii text file, free of formatting characters, to use it in your scripts. I put the list of codes and equivalent terms into a text file that I named "ICDO-3".

I often use the following short subroutine to put the codes, and their term equivalents, into a hash. When I need to convert a code into its term, I just call the hash value.


open (ICD, "c\:\\ftp\\icd03\.txt");
$line = " ";
while ($line ne "")
  {
  $line = <ICD>;
  if ($line =~ /([0-9]{4})\/([0-9]{1}) +/o)
    {
    $code = $1 . $2;
    $term = $';
    $term =~ s/ *\n//o;
    $term = lc($term);
    $dictionary{$code} = $term;
    }
  }
close ICD;

This snippet of code will mean something to you when you look at the data format in SEER's ICDO-3 file.

The same can be done with an ICDO-2 file, and with the anatomic site codes (the primary site item, bytes 43-46).

Anatomic site codes and terms are available at:

http://www.ncri.ie/data.cgi/html/icdo2sites.shtml

An example of a site code and its term equivalent is:


C649   Kidney NOS*

*NOS means not otherwise specified

Everything we want to do with the SEER files involves parsing through the files, line (record) by line; and then pulling out the data we're interested in.

If you want to do creative data mining, you will need to learn a little computer programming.

For Perl and Ruby programmers, methods and scripts for using SEER and other publicly available biomedical databases, are described in detail in my prior books:

Perl Programming for Medicine and Biology

Ruby Programming for Medicine and Biology

An overview of the many uses of biomedical information is available in my book,
Biomedical Informatics.

More information on cancer is available in my recently published book, Neoplasms.

© 2008 Jules Berman

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, genetics of complex disease, genetics of common diseases, neoplasms, cancer, neoplasia, precancer, tumor, tumour, tumors, tumours, neoplasm, carcinogenesis, carcinogens, tumor genetics

As specified in the SEER Data Agreement, the citation for the SEER data is as follows:

"Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) Limited-Use Data (1973-2005), National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch, released April 2008, based on the November 2007 submission."

As with all of my scripts, lists, web sites, and blog entries, the following disclaimer applies. This material is provided by its creator, Jules J. Berman, "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the material or the use or other dealings in the material.