In yesterday's blog, we began a series in which we'll discuss using Google's ngram data for medical research. We showed that with Google's ngram viewer, you can enter a word or phrase and find the frequency of occurrences of the phrase in books collected over the past half-millennium. The ngram viewer is intended to show us how particular words and phrases grow or wane in popularity.
There are now many websites that discuss the ngram viewer, but they all seem to be stuck in the realms of culture and literature; nobody seems to be using the ngram viewer for medical research [if this observation is incorrect, please send me a comment].
Words and phrases can tell us a lot about the patterns of disease. With the Google ngram collection, we can answer questions for which there is no other source of informative data [i.e., no historical data, and no existing collections of past observations or measurements]. We saw a few examples in yesterday's blog.
The drawback to Google's ngram user is that it produces one-off graphs from a single or small number of words and phrases, and performs a particular type of calculation (word/phrase occurrences as a percentage of total for a particular year).
When you're interested on analyzing a large dataset, you really want to do a global analysis over the data (i.e., analyzing the occurrences of every word or phrase, measured by all possible parameters, all at once). Then, when you start to mine the resulting data, you can look for any kind of trend, among any or all ways of grouping the data.
To understand the problem, let's look at two records in the Google dataset (provided in Google's ngram download page, .
circumvallate 1978 313 215 85
circumvallate 1979 183 147 77
The 1-gram "circumvallate" occurs 313 times in the 1978 literature, appearing on 215 pages, and in 85 books. In 1979, circumvallate occured 183 times, on 147 pages, in a total of 77 books. Depending on our question, we might be interested in the trends of word/phrases expressed as any of of these three parameters (total occurrences, page occurrences, book occurrences).
In the case of a medical term, we might be interested in combining the data for a word with all of its synonyms or plesionyms (near-synonyms). For example, we might want to sum the data for renal carcinoma, kidney cancer, renal ca, kidney ca, renal carcinoma, kidney carcinoma, carcinoma of the kidney, carcinoma of the kidneys, and so on.
Beyond the occurrence of near-synonymous terms, we might want to group classes of terms (e.g., all tumors, diseases spread by insects).
We might want to know the specific year that a term first came into use, or the specific year after which a term ceased to occur in the literature.
We might want to confine our attention to books that contain specific types of terms (e.g., names of diseases) and to produce a frequency calculation that excludes books that do not contain names of diseases.
We might want to look at the frequency order of terms or groups of terms in a particular publication year.
We might want to combine ngram data with relevant data included in other datasets.
All of these examples, and many more, cannot be accomplished by using Google's public ngram viewer.
The only way we can make any progress with these kinds of questions is to download the ngram data and write our own scripts to analyze the data.
In the next few blogs, I'll provide step-by-step instructions for acquiring, parsing, and analyzing the ngram data.
- © 2011 Jules Berman
In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.
I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.
- Jules J. Berman, Ph.D., M.D.
tags: common disease, orphan disease, orphan drugs, rare disease, disease genetics, genetics of complex disease,
genetics of common diseases, cryptic diseasengrams, Google ngram viewer, doublets, indexing, index, information retrieval, medical informatics, methods, translational research, data mining, datamining