Tuesday, January 15, 2008

Easy method for building public domain medical text corpus

In the last few blogs, I used a public domain corpus of medical citations to demonstrate an automatic scrubber (medical text deidentifier).

It is remarkably easy to create a large public domain text corpus for almost any medical specialty. All you need to do is to download a PubMed search by sending the search results to "file." If the search collects 50,000 citations, all of the citations will be sent to a file (that you name) on your own hard drive.

The Pubmed Search site is:
http://www.ncbi.nlm.nih.gov/pubmed/?
Titles and names are , according to the U.S. Copyright Office, excluded from copyright. PubMed citations, which consist of titles, names, and some annotation data (volume, pages, date), can be used freely. The same cannot be said for abstracts, which, as far as I can tell, can have copyright. The default "Summary" display (shown in image) produces a list of citations (without abstracts) when downloaded as a file.

-Jules Berman


Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.