Sunday, January 2, 2011

Medical research with google ngrams

This blog post marks the beginning of a series of articles on the general topic of indexing. Eventually, I'll get to standard back-of-book indexing, but I'm going to start with an advanced topic: ngram indexing.

Ngrams are the ordered word sequences in text.

If a text string is:

"Say hello to the cat"

The ngrams are:

say (1-gram or singlet or singleton)
hello (1-gram or singlet or singleton)
to (1-gram or singlet or singleton)
the (1-gram or singlet or singleton)
cat (1-gram or singlet or singleton)
say hello (2-gram or doublet)
hello to (2-gram or doublet)
to the (2-gram or doublet)
the cat (2-gram or doublet)
say hello to (3-gram or triplet)
hello to the (3-gram or triplet)
to the cat (3-gram or triplet)
say hello to the (4-gram or quadruplet)
hello to the cat (4-gram or quadruplet)
say hello to the cat (5-gram or quint or quintuplet)

Google has undertaken a massive effort to enumerate the ngrams collected from the scanned literature dating back to 1500. Moreover, Google has released the ngram files to the public.

The files are available for download at:

We can use Google's own ngram viewer to do our own epidemiologic research.

When we look at the frequency of occurrence of the 2-gram "yellow fever" we get the following Google output.

Click on image for larger view

We see that the term "yellow fever" (a mosquito-transmitted hepatitis) appeared in the literature beginning about 1800 (the time of its largest peak), with several subsequent peaks (around 1915 and 1945). The dates of the three peaks correspond roughly to outbreaks of yellow fever in Philadelphia (1993, with thousands of deaths), the construction of the Panama canal (finished in 1914, after incurring over 5,000 deaths), and WWII Pacific outbreaks, countered by mass immunizations with a new, and unproven yellow fever vaccine. In this case, a simple review of n-gram "traffic" provides an accurate view of the yellow fever outbreaks.

Let's see the n-gram occurrence graph for "lung cancer".

Click on image for larger view

There is virtually no mention of lung cancer before the 20th century. Why? Because lung cancer was rare before the introduction of cigarettes. Here is what Wikipedia has to say about cigarette smoking through the twentieth century. "The widespread smoking of cigarettes in the Western world is largely a 20th century phenomenon – at the start of the century the per capita annual consumption in the USA was 54 cigarettes (with less than 0.5% of the population smoking more than 100 cigarettes per year)".

While lung cancer did not occur in great frequency until the twentieth century, gastric cancer has been around quite a while. In fact, the incidence of stomach cancer has been dropping in the last half of the twentieth century, [presumably due to refrigeration, other safe methods of food preservation, and the general availability of potable water in industrialized countries]. Here's the ngram graph for gastric cancer.

Click on image for larger view

Notice that the graph has about the same shape whether it's searching gastric cancer or stomach cancer or related synonyms. This tells us that the "traffic" for a medical term and its synonyms can provides similar trends (but with differing amplitudes allowing for usage).

Finally, let's look at my favorite subject in tumor biology, the precancers.

Click on image for larger view

Precancer terms have occurred with increasing frequency in the twentieth century (perhaps indicating the importance of this class of lesions).

Searching for medical ngrams, using Google's ngram viewer has some scientific merit. If we want to get the most out of the ngram files, we will need to do a global analysis of the ngram data related to medical terms. This means we will need to download the ngram data sets and write our own scripts that can analyze the occurrences of every term of interest, all at once, finding correlations of medical significance.

Jump to tomorrow's blog to continue this discussion.
© 2008 Jules J. Berman Ph.D., M.D.
In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.
tags: ngrams, doublets, indexing, index, information retrieval, medical informatics, methods