Saturday, April 5, 2014

Armchair Science: No Experiments, Just Deduction

On Thursday, Amazon published Armchair Science: No Experiments, Just Deduction, as a Kindle Book.

The book develops the premise that science is not a collection of facts; science is what we can induce from facts.  By observing the night sky, without the aid of telescopes,  we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist.  Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old.   Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common.

In Armchair Science, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine.  Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining.

I hope that the readers of this blog will visit the book at its Amazon site and read the "look inside" pages.

Armchair Science is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Book details provided by the Amazon site:

File Size: 7868 KB
Print Length: 295 pages
Simultaneous Device Usage: Unlimited
Sold by: Amazon Digital Services, Inc.
Language: English

- Jules Berman, April 5, 2014

Thursday, June 6, 2013

Condensed Principles of Big Data

Last night I re-read yesterday's post (Toward Big Data Immutability), and I realized that there really is no effective way to use this blog to teach anyone the mechanics of Big Data construction and analysis.  My guess is that many readers were confused by the blog, because a single post cannot provide the back-story to the concepts included in the post.

So, basically, I give up.  If you want to learn the fundamentals of Big Data, you'll need to do some reading  I would recommend my own book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.  Depending on your background and agenda, you might prefer one of the hundreds of other books written for this vibrant field (I won't be offended).

The best I can do is to summarize, with a few principles, the basic theme of my book.

1. You cannot create a good Big Data resource without good identifiers.  A Big Data resource can be usefully envisioned as a system of identifiers to which data is attached.

2. Data must be described with metadata, and the metadata descriptors should be organized under a classification or an ontology.  The latter will drive down the complexity of the system and will permit heterogeneous data to be shared, merged, and queried across systems.

3. Big Data must be immutable.  You can add to Big Data, but you can never alter or delete the contained data.

4. Big Data must be accessible to the public if it is to have any scientific value.  Unless members of the public have a chance to verify, validate, and examine your data, the conclusions drawn from the data have almost no scientific credibility.

5. Data analysis is important, but data re-analysis is much more important.  There are many ways to analyze data, and it's hard to know when your conclusions are correct.  If principles 1 through 4 are followed, the data can be re-examined at a later time.  If you can re-analyze data, then the original analysis is not so critical. Sometimes, a re-analysis that occurs years or decades after the original report, fortified with new data obtained in the interim, can have enormous value and consequence.

- Jules Berman

key words: identification, identifier system, mutable, mutability, immutability, big data analysis, repeated analysis, Big Data analyses, analytic techniques for Big Data, scientific validity of Big Data, open access Big Data, public access Big Data, Big Data concepts, Jules J. Berman, Ph.D., M.D.

Wednesday, June 5, 2013

Toward Big Data Immutability

Today's blog continues yesterday's discussion of Big Data Immutability.

Big Data managers must do what seems to be impossible; they must learn how to modify data without altering the original content.  The trick is accomplished with identifiers and time-stamps attached to event data (and yes, it's all discussed at greater length in my book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information).

In today's blog, let's just focus on the concept of a time-stamp. Temporal events must be given a time-stamp indicating the time that the event occurred, using a standard measurement for time. The time-stamp must be accurate, persistent, and immutable.

Time-stamps are not tamper-proof. In many instances, changing a recorded time residing in a file or data set requires nothing more than viewing the data on your computer screen and substituting one date and time for another.  Dates that are automatically recorded, by your computer system, can also be altered. Operating systems permit users to reset the system date and time.  Because the timing of events can be altered, scrupulous data managers employ a trusted time-stamp protocol by which a time-stamp can be verified.

Here is a description of how a trusted time-stamp protocol might work.  You have just created a message, and you need to document that the message existed on the current date.  You create a one-way hash on the message (a fixed-length sequence of seemingly random alphanumeric characters). You send the one-way hash sequence to your city's newspaper, with instructions to publish the sequence in the classified section of that day's late edition. You're done.  Anyone questioning whether the message really existed on that particular date can perform their own one-way has on the message and compare the sequence with the sequence that was published in the city newspaper on that date.  The sequences will be identical to each other.

Today, newspapers are seldom used in trusted time stamp protocols.  Cautious Big Data managers employ trusted time authorities and encrypted time values to create authenticated and verifiable time-stamp data.  It's all done quickly and transparently, and you end up with event data (log-ins, transactions, quantities received, observations, etc.) that are associated with an identifier, a time, and a descriptor (e.g., a tag that explains the data).  When new events occur, they can be added to a data object containing related event data.  The idea behind all this activity is that old data need never be replaced by new data.  Your data object will always contain the information needed to distinguish one event from another, so that you can choose the event data that is appropriate to your query or your analysis.

-Jules Berman

key words: Big Data, mutable, mutability, data persistence, time stamp, time stamping, encrypted time stamp, data object, time-stamping an event

Tuesday, June 4, 2013

Consequences Of Data Mutability

Today's blog, like yesterday's blog, is based on a discussion in Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. The book's table of contents is shown in an earlier blog.

Here is an example of a immutability problem:  You are a pathologist working in a university hospital that has just installed a new, $600 million information system. On Tuesday, you released a report on a surgical biopsy, indicating that it contained cancer. On Friday morning, you showed the same biopsy to your colleagues, who all agreed that the biopsy was not malignant, and contained a benign condition that simulated malignancy (looked a little like a cancer, but was not).  Your original diagnosis was wrong, and now you must rectify the error.  You return to the computer, and access the prior report, changing the wording of the diagnosis to indicate that the biopsy is benign.  You can do this, because pathologists are granted "edit" access for pathology reports.  Now, everything seems to have been set right.  The report has been corrected, and the final report in the computer is official diagnosis.

Unknown to you, the patient's doctor read the incorrect report on Wednesday, the day after the incorrect report was issued, and two days before the correct report replaced the incorrect report. Major surgery was scheduled for the following Wednesday (five days after the corrected report was issued).  Most of the patient's liver was removed.  No cancer was found in the excised liver.  Eventually, the surgeon and patient learned that the original report had been altered.  The patient sued the surgeon, the pathologist, and the hospital.

You, the pathologist, argued in court that the computer held one report issued by the pathologist (following the deletion of the earlier, incorrect report) and that report was correct.  Therefore, you said, you made no error.  The patient's lawyer had access to a medical chart in which paper versions of the diagnosis had been kept.  The lawyer produced, for the edification of the jury, two reports from the same pathologist, on the same biopsy: one positive for cancer, the other benign.  The hospital, conceding that they had no credible defense, settled out of court for a very large quantity of money. Meanwhile, back in the hospital, a fastidious intern is deleting an erroneous diagnosis, and substituting his improved rendition.

One of the most important features of serious Big Data resources (such as the data collected in hospital information systems) is immutability.  The rule is simple.  Data is immortal and cannot change.  You can add data to the system, but you can never alter data and you can never erase data.  Immutability is counterintuitive to most people, including most data analysts.  If a patient has a glucose level of 100 on Monday, and the same patient has a glucose level of 115 on Tuesday, then it would seem obvious that his glucose level changed.  Not necessarily so.  Monday's glucose level remains at 100.  For the end of time, Monday's glucose level will always be 100.  On Tuesday, another glucose level was added to the record for the patient.  Nothing that existed prior to Tuesday was changed.

The key to maintaining immutability in Big Data resources is time-stamping.  In the next blog, we will discuss how data objects hold time-stamped events. 

key words: mutability, immutability, time-stamp, time stamp, altered data, data integrity 

Monday, June 3, 2013

Big Data Is Immutable

Today's blog, like yesterday's blog, is based on a discussion in Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.  The book's table of contents is shown in an earlier blog.

excerpt from book: "Everyone is familiar with the iconic image, from Orwell's 1984, of a totalitarian government that watches its citizens from telescreens. The ominous phrase, "Big Brother is watching you," evokes an important thesis of Orwell's masterpiece; that a totalitarian government can use an expansive surveillance system to crush its critics.  Lest anyone forget, Orwell's book had a second thesis, that was, in my opinion, more insidious and more disturbing than the threat of governmental surveillance.  Orwell was concerned that governments could change the past and the present by inserting, deleting, and otherwise distorting the information available to citizens.  In Orwell's 1984, old reports of military defeats, genocidal atrocities, ineffective policies, mass starvation, and any ideas that might foment unrest among the proletariat, could all be deleted and replaced with propaganda pieces.  Such truth-altering activities were conducted undetected, routinely distorting everyone's perception of reality to suit a totalitarian agenda. Aside from understanding the dangers inherent in a surveillance-centric society, Orwell [foretold] the dangers inherent with mutable Big Data." [i.e., when archived data that can be deleted, inserted or altered].

"One of the purposes of this book is to describe the potential negative consequences of Big Data, if the data is not collected ethically, not prepared thoughtfully, not analyzed openly, and not subjected to constant public review and correction."

In tomorrow's blog, I'll continue as discussion of mutability and immutability as they pertain to the design and maintenance of Big Data resources.

- Jules Berman 

key words: Big Data, Jules J. Berman, Ph.D., M.D., data integrity, data abuse, data revision, dystopia, dystopian society, distortion of reality, big brother mentality

Sunday, June 2, 2013

Big Data Versus Massive Data

This post is based on a topic covered in Big Data: Preparing, Sharing, and Analyzing Complex Information, by Jules J Berman.

In yesterday's blog, we discussed the differences between Big Data and small data.  Today,   I wanted to briefly discuss the differences between Big Data and massive data.

Big Data is defined by the three v's: 

1. Volume - large amounts of data;.

2. Variety - the data comes in different forms, including traditional databases, images, documents, complex records;.

3. Velocity - the content of the data is constantly changing, through the absorption of complementary data collections, through the introduction of previously archived data or legacy collections, and from streamed data arriving from multiple sources. 

It is important to distinguish Big Data from "lotsa data" or "massive data".  In a Big Data Resource, all three v's must apply.  It is the size, complexity, and restlessness of Big Data resources that account for the methods by which these resources are designed, operated, and analyzed.

The term "massive data" or "lotsa data" is often applied to enormous collections of simple-format records.  Massive datasets are typically equivalent to enormous spreadsheets (2-dimensional tables of columns and rows), mathematically equivalent to an immense matrix. For scientific purposes, it is sometimes necessary to analyze all of the data in a matrix, all at once.  The analyses of enormous matrices is computationally intensive, and may require the resources of a supercomputer. 

Big Data resources are not equivalent to a large spreadsheet, and a Big Data resource is seldom analyzed in its totality.  Big Data analysis is a multi-step process whereby data is extracted, filtered, and transformed, with analysis often proceeding in a piecemeal, sometimes recursive, fashion.  

If you read  Big Data: Preparing, Sharing, and Analyzing Complex Information you will find that the gulf between massive data and Big Data is profound; the two subjects can seldom be discussed productively within the same venue.

- Jules Berman

key words: Big Data, lotsa data, massive data, data analysis, data analyses, large-scale data, Big Science

Saturday, June 1, 2013

Differences between Big Data and Small Data

Big Data is very different from small data.  Here are some of the  important features that distinguish one from the other.

1. Goals

  • small data-Usually designed to answer a specific question or serve a particular goal. 
  • Big Data-Usually designed with a goal in mind, but the goal is flexible and the questions posed are protean. 

2. Location

  • small data-Typically, small data is contained within one institution, often on one computer, sometimes in one file. 
  • Big Data-Typically spread throughout electronic space, typically parceled onto multiple Internet servers, located anywhere on earth. 

3. Data structure and content

  • small data-Ordinarily contains highly structured data. The data domain is restricted to a single discipline or subdiscipline. The data often comes in the form of uniform records in an ordered spreadsheet. 
  • Big Data-Must be capable of absorbing unstructured data (e.g., such as free-text documents, images, motion pictures, sound recordings, physical objects). The subject matter of the resource may cross multiple disciplines, and the individual data objects in the resource may link to data contained in other, seemingly unrelated, Big Data resources. 

4. Data preparation

  • small data-In many cases, the data user prepares her own data, for her own purposes. 
  • Big Data-The data comes from many diverse sources, and it is prepared by many people. People who use the data are seldom the people who have prepared the data. 

5. Longevity

  • small data-When the data project ends, the data is kept for a limited time and then discarded. 
  • Big Data-Big Data projects typically contain data that must be stored in perpetuity. Ideally, data stored in a Big Data resource will be absorbed into another resource when the original resource terminates. Many Big Data projects extend into the future and the past (e.g., legacy data), accruing data prospectively and retrospectively. 

6. Measurements

  • small data-Typically, the data is measured using one experimental protocol, and the data can be represented using one set of standard units (see Glossary item, Protocol). 
  • Big Data -Many different types of data are delivered in many different electronic formats. Measurements, when present, may be obtained by many different protocols. Verifying the quality of Big Data is one of the most difficult tasks for data managers. 

7. Reproducibility

  • small data-Projects are typically repeatable. If there is some question about the quality of the data, reproducibility of the data, or validity of the conclusions drawn from the data, the entire project can be repeated, yielding a new data set. 
  • Big Data-Replication of a Big Data project is seldom feasible. In most instances, all that anyone can hope for is that bad data in a Big Data resource will be found and flagged as such. 

8. Stakes

  • small data-Project costs are limited. Laboratories and institutions can usually recover from the occasional small data failure. 
  • Big Data-Big Data projects can be obscenely expensive. A failed Big Data effort can lead to bankruptcy, institutional collapse, mass firings, and the sudden disintegration of all the data held in the resource. 

9. Introspection

  • small data-Individual data points are identified by their row and column location within a spreadsheet or database table (see Glossary item, Data point). If you know the row and column headers, you can find and specify all of the data points contained within. 
  • Big Data-Unless the Big Data resource is exceptionally well designed, the contents and organization of the resource can be inscrutable, even to the data managers (see Glossary item, Data manager). Complete access to data, information about the data values, and information about the organization of the data is achieved through a technique herein referred to as introspection (see Glossary item, Introspection). 

10. Analysis

  • small data-In most instances, all of the data contained in the data project can be analyzed together, and all at once. 
  • Big Data-With few exceptions, such as those conducted on supercomputers or in parallel on multiple computers, Big Data is ordinarily analyzed in incremental steps (see Glossary items, Parallel computing, MapReduce). The data are extracted, reviewed, reduced, normalized, transformed, visualized, interpreted, and reanalyzed with different methods.

Friday, May 31, 2013

Big Data Book Explained

My Big Data book

 In yesterday's blog, I announced the publication of my new book, Principles of Big Data, 1st Edition: Preparing, Sharing, and Analyzing Complex Information.  Here is a short essay describing some of the features that distinguish this Big Data book from all of the others.

The book describes:
  • How do deal with complex data objects (unstructured text, categorical data, quantitative data, images, etc.), and how to extract small data sets (the kind you're probably familiar with), from Big Data resources.
  • How to create Big Data resources in a legal, ethical and scientifically sensible manner.
  • How to inspect and analyze Big Data. 
  • How to verify and validate the data and the conclusions drawn from the data.
.The book expands upon several subjects that are omitted from most Big Data books.
  • Identifiers and deidentification. Most people simply do not understand the importance of creating unique identifiers for data objects. In the book, I build an argument that Big Data resources are basically just a set of identifiers to which data is attached. Without proper identifiers there can be no useful analysis of Big Data. The book goes into some detail explaining how data objects can be identified and de-identified, and how data identifiers are crucial for merging data obtained from heterogeneous data sources. 
  • Metadata, Classification, and Introspection. Classifications drive down the complexity of Big Data, and metadata permits data objects to contain self-descriptive information about the data contained in the object, and the placement of the data object within the classification. The ability of data objects to provide information about themselves is called introspection. It is another property of serious Big Data resources that allows data from different resources to be shared and merged.
  • Immutability. The data within Big Data resources must be immutable. This means that if a data objects has a certain value at a certain time, then it will retain that value forever. From the medical field, if I have a glucose level of 85 on Tuesday, and if I have a new measurement taken on Friday, which tells me that my glucose level is 105, then the glucose level of 105 does not replace the earlier level. It merely, creates a second value, with its own time-stamp and its own identifier, both belonging to some data object (e.g., the data object that contains the lab tests of Jules Berman). In the book, I emphasize the practical importance of building immutability into Big Data resources, and how this might be achieved. 
  • Estimation. I happen to believe that almost every Big Data project should start off with a quick and dirty estimation. Most of the sophisticated analytic methods simply improve upon simple and intuitive data analysis methods. I devote several chapters to describing how data users should approach Big Data projects, how to assess the available data, and how to quickly estimate your results.
  • Failures. It may come as a shock, but most Big Data efforts fail. Money, staff, and expertise cannot compensate for resources that overlook the fundamental properties of Big Data. In this book, I discuss and dissect some of the more spectacular failures, including several from the realm of Big Science.
A full Table of Contents for Principles of Big Data, 1st Edition: Preparing, Sharing, and Analyzing Complex Information is found at my web site. You can purchase the book from the Elsevier site. At the site, the listed publication date is June 15, but my editor assures me that the warehouse has copies of the book, and that the official publication date has been moved to June 4. If you have any trouble placing the order, you can wait until June 4 and try again.

Amazon has already released its Kindle version of the book.

Jules Berman

tags: Big Data, Jules J. Berman, mutability, de-identification, troubleshooting Big Data

Thursday, May 30, 2013

Big Data Book Contents

I've taken a hiatus from the Specified Life blog while I wrote my latest book, entitled, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.

The Kindle edition is available now, and Amazon has a "look-inside" option on their book page. The print version will be available in a week or two, and Amazon is taking pre-orders Here is the complete Table of Contents:
Acknowledgments xi
Author Biography xiii
Preface xv
Introduction xix

1. Providing Structure to Unstructured Data
  Background 1
  Machine Translation 2
  Autocoding 4
  Indexing 9
  Term Extraction 11

2. Identification, Deidentification, and Reidentification
  Background 15
  Features of an Identifier System 17
  Registered Unique Object Identifiers 18
  Really Bad Identifier Methods 22
  Embedding Information in an Identifier: Not Recommended 24
  One-Way Hashes 25
  Use Case: Hospital Registration 26
  Deidentification 28
  Data Scrubbing 30
  Reidentification 31
  Lessons Learned 32

3. Ontologies and Semantics
  Background 35
  Classifications, the Simplest of Ontologies 36
  Ontologies, Classes with Multiple Parents 39
  Choosing a Class Model 40
  Introduction to Resource Description Framework Schema 44
  Common Pitfalls in Ontology Development 46

4. Introspection
  Background 49
  Knowledge of Self 50
  eXtensible Markup Language 52
  Introduction to Meaning 54
  Namespaces and the Aggregation of Meaningful Assertions 55
  Resource Description Framework Triples 56
  Reflection 59
  Use Case: Trusted Time Stamp 59
  Summary 60

5. Data Integration and Software Interoperability
  Background 63
  The Committee to Survey Standards 64
  Standard Trajectory 65
  Specifications and Standards 69
  Versioning 71
  Compliance Issues 73
  Interfaces to Big Data Resources 74

6. Immutability and Immortality
  Background 77
  Immutability and Identifiers 78
  Data Objects 80
  Legacy Data 82
  Data Born from Data 83
  Reconciling Identifiers across Institutions 84
  Zero-Knowledge Reconciliation 86
  The Curator’s Burden 87

7. Measurement
  Background 89
  Counting 90
  Gene Counting 93
  Dealing with Negations 93
  Understanding Your Control 95
  Practical Significance of Measurements 96
  Obsessive-Compulsive Disorder: The Mark of a Great Data Manager 97

8. Simple but Powerful Big Data Techniques
  Background 99
  Look at the Data 100
  Data Range 110
  Denominator 112
  Frequency Distributions 115
  Mean and Standard Deviation 119
  Estimation-Only Analyses 122
  Use Case: Watching Data Trends with Google Ngrams 123
  Use Case: Estimating Movie Preferences 126

9. Analysis
  Background 129
  Analytic Tasks 130
  Clustering, Classifying, Recommending, and Modeling 130
  Data Reduction 134
  Normalizing and Adjusting Data 137
  Big Data Software: Speed and Scalability 139
  Find Relationships, Not Similarities 141

10. Special Considerations in Big Data Analysis
  Background 145
  Theory in Search of Data 146
  Data in Search of a Theory 146
  Overfitting 148
  Bigness Bias 148
  Too Much Data 151
  Fixing Data 152
  Data Subsets in Big Data: Neither Additive nor Transitive 153
  Additional Big Data Pitfalls 154

11. Stepwise Approach to Big Data Analysis
  Background 157
  Step 1. A Question Is Formulated 158
  Step 2. Resource Evaluation 158
  Step 3. A Question Is Reformulated 159
  Step 4. Query Output Adequacy 160
  Step 5. Data Description 161
  Step 6. Data Reduction 161
  Step 7. Algorithms Are Selected, If Absolutely Necessary 162
  Step 8. Results Are Reviewed and Conclusions Are Asserted 164
  Step 9. Conclusions Are Examined and Subjected to Validation 164

12. Failure
  Background 167
  Failure Is Common 168
  Failed Standards 169
  Complexity 172
  When Does Complexity Help? 173
  When Redundancy Fails 174
  Save Money; Don’t Protect Harmless Information 176
  After Failure 177
  Use Case: Cancer Biomedical Informatics Grid, a Bridge Too Far 178

13. Legalities
  Background 183
  Responsibility for the Accuracy and Legitimacy of Contained Data 184
  Rights to Create, Use, and Share the Resource 185
  Copyright and Patent Infringements Incurred by Using Standards 187
  Protections for Individuals 188
  Consent 190
  Unconsented Data 194
  Good Policies Are a Good Policy 197
  Use Case: The Havasupai Story 198

14. Societal Issues
  Background 201
  How Big Data Is Perceived 201
  The Necessity of Data Sharing, Even When It Seems Irrelevant 204
  Reducing Costs and Increasing Productivity with Big Data 208
  Public Mistrust 210
  Saving Us from Ourselves 211
  Hubris and Hyperbole 213

15. The Future
  Background 217
  Last Words 226

Glossary 229

References 247

Index 257
In the next few days, I'll be posting short excerpts from the book, along with commentary. Best,
Jules Berman

key words: big data, heterogeneous data, complex datasets, Jules J. Berman, Ph.D., M.D., immutability, introspection, identifiers, de-identification, deidentification, confidentiality, privacy, massive data, lotsa data

Monday, June 27, 2011

Patient identifiers

I have just posted an article on patient identifiers. Here is a short excerpt from the article:
Imagine this scenario. You show up for treatment in the hospital where you were born, and in which you have been seen for various ailments over the past three decades. One of the following events transpires:

1. The hospital has a medical record of someone with your name, but it's not you. After much effort, they find another medical record with your name. Once again, it's the wrong person. After much time and effort, you are told that the hospital has no record for you.

2. The hospital has your medical record. After a few minutes with your doctor, it becomes obvious to both of you that the record is missing a great deal of information, relating to tests and procedures done recently and in the distant past. Nobody can find these missing records. You ask your doctor whether your records may have been inserted into the electronic chart of another patient or of multiple patients. The doctor does not answer your question.

3. The hospital has your medical record, but after a few moments, it becomes obvious that the record includes a variety of tests done on patients other than yourself. Some of the other patients have your name. Others have a different name. Nobody seems to understand how these records got into your chart.

4. You are informed that the hospital has changed its hospital information system, and your old electronic records are no longer available. You are asked to answer a long list of questions concerning your medical history. Your answers will be added to your new medical chart. You can't answer any of the questions with much certainty.

5. You are told that your electronic record was transferred to the hospital information system of a large multi-hospital system. This occurred as a consequence of a complex acquisition and merger. The hospital in which you are seeking care has not yet been deployed within the information structure of the multi-hospital system and has no access to your record. You are assured that the record has not been lost and will be accessible within the decade.

6. You arrive at your hospital to find that it has been demolished and replaced by a shopping center. Your electronic records are gone forever.

These are the kinds of problems that arise when hospitals lack a proper patient identifier system (a common shortcoming). The purpose of the article is to list the features of a patient identifier system, emphasizing the essential role of identifiers in healthcare services and biomedical research.

The full-length article is available at:

© 2011 Jules J. Berman

Thursday, March 31, 2011

Post-Informatics Pathology

For those who have been reading my blogs sequentially, I apologize for my lapse in the google ngram series. I've been preoccupied with other projects, but I hope to pick up where I left off, soon.

In the meantime, the Journal of Pathology Informatics has just published my article on "Post-Informatics Pathology." It is available at:

- Jules Berman

Monday, January 3, 2011

Google ngram medical research 2

In yesterday's blog, we began a series in which we'll discuss using Google's ngram data for medical research. We showed that with Google's ngram viewer, you can enter a word or phrase and find the frequency of occurrences of the phrase in books collected over the past half-millennium. The ngram viewer is intended to show us how particular words and phrases grow or wane in popularity.

There are now many websites that discuss the ngram viewer, but they all seem to be stuck in the realms of culture and literature; nobody seems to be using the ngram viewer for medical research [if this observation is incorrect, please send me a comment].

Words and phrases can tell us a lot about the patterns of disease. With the Google ngram collection, we can answer questions for which there is no other source of informative data [i.e., no historical data, and no existing collections of past observations or measurements]. We saw a few examples in yesterday's blog.

The drawback to Google's ngram user is that it produces one-off graphs from a single or small number of words and phrases, and performs a particular type of calculation (word/phrase occurrences as a percentage of total for a particular year).

When you're interested on analyzing a large dataset, you really want to do a global analysis over the data (i.e., analyzing the occurrences of every word or phrase, measured by all possible parameters, all at once). Then, when you start to mine the resulting data, you can look for any kind of trend, among any or all ways of grouping the data.

To understand the problem, let's look at two records in the Google dataset (provided in Google's ngram download page, .

circumvallate 1978 313 215 85
circumvallate 1979 183 147 77

The 1-gram "circumvallate" occurs 313 times in the 1978 literature, appearing on 215 pages, and in 85 books. In 1979, circumvallate occured 183 times, on 147 pages, in a total of 77 books. Depending on our question, we might be interested in the trends of word/phrases expressed as any of of these three parameters (total occurrences, page occurrences, book occurrences).

In the case of a medical term, we might be interested in combining the data for a word with all of its synonyms or plesionyms (near-synonyms). For example, we might want to sum the data for renal carcinoma, kidney cancer, renal ca, kidney ca, renal carcinoma, kidney carcinoma, carcinoma of the kidney, carcinoma of the kidneys, and so on.

Beyond the occurrence of near-synonymous terms, we might want to group classes of terms (e.g., all tumors, diseases spread by insects).

We might want to know the specific year that a term first came into use, or the specific year after which a term ceased to occur in the literature.

We might want to confine our attention to books that contain specific types of terms (e.g., names of diseases) and to produce a frequency calculation that excludes books that do not contain names of diseases.

We might want to look at the frequency order of terms or groups of terms in a particular publication year.

We might want to combine ngram data with relevant data included in other datasets.

All of these examples, and many more, cannot be accomplished by using Google's public ngram viewer.

The only way we can make any progress with these kinds of questions is to download the ngram data and write our own scripts to analyze the data.

In the next few blogs, I'll provide step-by-step instructions for acquiring, parsing, and analyzing the ngram data.

- © 2011 Jules Berman

key words: ngrams, Google ngram viewer, doublets, indexing, index, information retrieval, medical informatics, methods, translational research, data mining, datamining

Sunday, January 2, 2011

Medical research with google ngrams

This blog post marks the beginning of a series of articles on the general topic of indexing. Eventually, I'll get to standard back-of-book indexing, but I'm going to start with an advanced topic: ngram indexing.

Ngrams are the ordered word sequences in text.

If a text string is:

"Say hello to the cat"

The ngrams are:

say (1-gram or singlet or singleton)
hello (1-gram or singlet or singleton)
to (1-gram or singlet or singleton)
the (1-gram or singlet or singleton)
cat (1-gram or singlet or singleton)
say hello (2-gram or doublet)
hello to (2-gram or doublet)
to the (2-gram or doublet)
the cat (2-gram or doublet)
say hello to (3-gram or triplet)
hello to the (3-gram or triplet)
to the cat (3-gram or triplet)
say hello to the (4-gram or quadruplet)
hello to the cat (4-gram or quadruplet)
say hello to the cat (5-gram or quint or quintuplet)

Google has undertaken a massive effort to enumerate the ngrams collected from the scanned literature dating back to 1500. Moreover, Google has released the ngram files to the public.

The files are available for download at:

We can use Google's own ngram viewer to do our own epidemiologic research.

When we look at the frequency of occurrence of the 2-gram "yellow fever" we get the following Google output.

Click on image for larger view

We see that the term "yellow fever" (a mosquito-transmitted hepatitis) appeared in the literature beginning about 1800 (the time of its largest peak), with several subsequent peaks (around 1915 and 1945). The dates of the three peaks correspond roughly to outbreaks of yellow fever in Philadelphia (1993, with thousands of deaths), the construction of the Panama canal (finished in 1914, after incurring over 5,000 deaths), and WWII Pacific outbreaks, countered by mass immunizations with a new, and unproven yellow fever vaccine. In this case, a simple review of n-gram "traffic" provides an accurate view of the yellow fever outbreaks.

Let's see the n-gram occurrence graph for "lung cancer".

Click on image for larger view

There is virtually no mention of lung cancer before the 20th century. Why? Because lung cancer was rare before the introduction of cigarettes. Here is what Wikipedia has to say about cigarette smoking through the twentieth century. "The widespread smoking of cigarettes in the Western world is largely a 20th century phenomenon – at the start of the century the per capita annual consumption in the USA was 54 cigarettes (with less than 0.5% of the population smoking more than 100 cigarettes per year)".

While lung cancer did not occur in great frequency until the twentieth century, gastric cancer has been around quite a while. In fact, the incidence of stomach cancer has been dropping in the last half of the twentieth century, [presumably due to refrigeration, other safe methods of food preservation, and the general availability of potable water in industrialized countries]. Here's the ngram graph for gastric cancer.

Click on image for larger view

Notice that the graph has about the same shape whether it's searching gastric cancer or stomach cancer or related synonyms. This tells us that the "traffic" for a medical term and its synonyms can provides similar trends (but with differing amplitudes allowing for usage).

Finally, let's look at my favorite subject in tumor biology, the precancers.

Click on image for larger view

Precancer terms have occurred with increasing frequency in the twentieth century (perhaps indicating the importance of this class of lesions).

Searching for medical ngrams, using Google's ngram viewer has some scientific merit. If we want to get the most out of the ngram files, we will need to do a global analysis of the ngram data related to medical terms. This means we will need to download the ngram data sets and write our own scripts that can analyze the occurrences of every term of interest, all at once, finding correlations of medical significance.

Jump to tomorrow's blog to continue this discussion.

- © 2011 Jules Berman

key words: ngrams, doublets, indexing, index, information retrieval, medical informatics, methods

Wednesday, October 27, 2010

Extracting names from text file

I'm beginning, for this blog, a series of short utility scripts and essays that relate, in one way or another, to the general subject of indexing and data retrieval.

The first entry is a short Perl script (just 18 command lines) that extracts the names (of people) wherever the names may occur within a provided text file. The output consists of an alphabetized list of non-repeating names. The script is so simple that it can be easily be translated into any language that supports regular expressions (regex).

The script is available at:

Blog readers who are uninterested in indexing and data retrieval may want to visit my two other blogs,

Machiavelli's Laboratory (scientific ethics taught from the perspective on an unethical scientist)


Neoplasms (essays on tumor biology)

- © 2010 Jules Berman

key words: indices, indexing, indexes, index, data retrieval, information retrieval, informatics

Thursday, October 14, 2010

Germ cell tumor web page available

The recent blog series on germ cell tumors has been packaged into a single web page available at:

- © 2010 Jules Berman

key words: carcinogenesis, neoplasia, neoplasms, tumor development, tumour development, tumor biology, tumour biology, carcinogenesis