Monday, January 3, 2011

Google ngram medical research 2

In yesterday's blog, we began a series in which we'll discuss using Google's ngram data for medical research. We showed that with Google's ngram viewer, you can enter a word or phrase and find the frequency of occurrences of the phrase in books collected over the past half-millennium. The ngram viewer is intended to show us how particular words and phrases grow or wane in popularity.

There are now many websites that discuss the ngram viewer, but they all seem to be stuck in the realms of culture and literature; nobody seems to be using the ngram viewer for medical research [if this observation is incorrect, please send me a comment].

Words and phrases can tell us a lot about the patterns of disease. With the Google ngram collection, we can answer questions for which there is no other source of informative data [i.e., no historical data, and no existing collections of past observations or measurements]. We saw a few examples in yesterday's blog.

The drawback to Google's ngram user is that it produces one-off graphs from a single or small number of words and phrases, and performs a particular type of calculation (word/phrase occurrences as a percentage of total for a particular year).

When you're interested on analyzing a large dataset, you really want to do a global analysis over the data (i.e., analyzing the occurrences of every word or phrase, measured by all possible parameters, all at once). Then, when you start to mine the resulting data, you can look for any kind of trend, among any or all ways of grouping the data.

To understand the problem, let's look at two records in the Google dataset (provided in Google's ngram download page, .

circumvallate 1978 313 215 85
circumvallate 1979 183 147 77

The 1-gram "circumvallate" occurs 313 times in the 1978 literature, appearing on 215 pages, and in 85 books. In 1979, circumvallate occured 183 times, on 147 pages, in a total of 77 books. Depending on our question, we might be interested in the trends of word/phrases expressed as any of of these three parameters (total occurrences, page occurrences, book occurrences).

In the case of a medical term, we might be interested in combining the data for a word with all of its synonyms or plesionyms (near-synonyms). For example, we might want to sum the data for renal carcinoma, kidney cancer, renal ca, kidney ca, renal carcinoma, kidney carcinoma, carcinoma of the kidney, carcinoma of the kidneys, and so on.

Beyond the occurrence of near-synonymous terms, we might want to group classes of terms (e.g., all tumors, diseases spread by insects).

We might want to know the specific year that a term first came into use, or the specific year after which a term ceased to occur in the literature.

We might want to confine our attention to books that contain specific types of terms (e.g., names of diseases) and to produce a frequency calculation that excludes books that do not contain names of diseases.

We might want to look at the frequency order of terms or groups of terms in a particular publication year.

We might want to combine ngram data with relevant data included in other datasets.

All of these examples, and many more, cannot be accomplished by using Google's public ngram viewer.

The only way we can make any progress with these kinds of questions is to download the ngram data and write our own scripts to analyze the data.

In the next few blogs, I'll provide step-by-step instructions for acquiring, parsing, and analyzing the ngram data.

- © 2011 Jules Berman

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, rare disease, disease genetics, genetics of complex disease, genetics of common diseases, cryptic diseasengrams, Google ngram viewer, doublets, indexing, index, information retrieval, medical informatics, methods, translational research, data mining, datamining

Sunday, January 2, 2011

Medical research with google ngrams

This blog post marks the beginning of a series of articles on the general topic of indexing. Eventually, I'll get to standard back-of-book indexing, but I'm going to start with an advanced topic: ngram indexing.

Ngrams are the ordered word sequences in text.

If a text string is:

"Say hello to the cat"

The ngrams are:

say (1-gram or singlet or singleton)
hello (1-gram or singlet or singleton)
to (1-gram or singlet or singleton)
the (1-gram or singlet or singleton)
cat (1-gram or singlet or singleton)
say hello (2-gram or doublet)
hello to (2-gram or doublet)
to the (2-gram or doublet)
the cat (2-gram or doublet)
say hello to (3-gram or triplet)
hello to the (3-gram or triplet)
to the cat (3-gram or triplet)
say hello to the (4-gram or quadruplet)
hello to the cat (4-gram or quadruplet)
say hello to the cat (5-gram or quint or quintuplet)

Google has undertaken a massive effort to enumerate the ngrams collected from the scanned literature dating back to 1500. Moreover, Google has released the ngram files to the public.

The files are available for download at:

http://ngrams.googlelabs.com/datasets

We can use Google's own ngram viewer to do our own epidemiologic research.

When we look at the frequency of occurrence of the 2-gram "yellow fever" we get the following Google output.


Click on image for larger view


We see that the term "yellow fever" (a mosquito-transmitted hepatitis) appeared in the literature beginning about 1800 (the time of its largest peak), with several subsequent peaks (around 1915 and 1945). The dates of the three peaks correspond roughly to outbreaks of yellow fever in Philadelphia (1993, with thousands of deaths), the construction of the Panama canal (finished in 1914, after incurring over 5,000 deaths), and WWII Pacific outbreaks, countered by mass immunizations with a new, and unproven yellow fever vaccine. In this case, a simple review of n-gram "traffic" provides an accurate view of the yellow fever outbreaks.

Let's see the n-gram occurrence graph for "lung cancer".


Click on image for larger view


There is virtually no mention of lung cancer before the 20th century. Why? Because lung cancer was rare before the introduction of cigarettes. Here is what Wikipedia has to say about cigarette smoking through the twentieth century. "The widespread smoking of cigarettes in the Western world is largely a 20th century phenomenon – at the start of the century the per capita annual consumption in the USA was 54 cigarettes (with less than 0.5% of the population smoking more than 100 cigarettes per year)".

While lung cancer did not occur in great frequency until the twentieth century, gastric cancer has been around quite a while. In fact, the incidence of stomach cancer has been dropping in the last half of the twentieth century, [presumably due to refrigeration, other safe methods of food preservation, and the general availability of potable water in industrialized countries]. Here's the ngram graph for gastric cancer.


Click on image for larger view


Notice that the graph has about the same shape whether it's searching gastric cancer or stomach cancer or related synonyms. This tells us that the "traffic" for a medical term and its synonyms can provides similar trends (but with differing amplitudes allowing for usage).

Finally, let's look at my favorite subject in tumor biology, the precancers.


Click on image for larger view


Precancer terms have occurred with increasing frequency in the twentieth century (perhaps indicating the importance of this class of lesions).

Searching for medical ngrams, using Google's ngram viewer has some scientific merit. If we want to get the most out of the ngram files, we will need to do a global analysis of the ngram data related to medical terms. This means we will need to download the ngram data sets and write our own scripts that can analyze the occurrences of every term of interest, all at once, finding correlations of medical significance.

Jump to tomorrow's blog to continue this discussion.
© 2008 Jules J. Berman Ph.D., M.D.
In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.


I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.
tags: ngrams, doublets, indexing, index, information retrieval, medical informatics, methods