Last night I re-read yesterday's post (Toward Big Data Immuntability), and I realized that there really is no effective way to use this blog to teach anyone the mechanics of Big Data construction and analysis. My guess is that many readers were confused by the blog, because a single post cannot provide the back-story to the concepts included in the post.
So, basically, I give up. If you want to learn the fundamentals of Big Data, you'll need to do some reading I would recommend my own book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Depending on your background and agenda, you might prefer one of the hundreds of other books written for this vibrant field (I won't be offended).
The best I can do is to summarize, with a few principles, the basic theme of my book.
1. You cannot create a good Big Data resource without good identifiers. A Big Data resource can be usefully envisioned as a system of identifiers to which data is attached.
2. Data must be described with metadata, and the metadata descriptors should be organized under a classification or an ontology. The latter will drive down the complexity of the system and will permit heterogeneous data to be shared, merged, and queried across systems.
3. Big Data must be immutable. You can add to Big Data, but you can never alter or delete the contained data.
4. Big Data must be accessible to the public if it is to have any scientific value. Unless members of the public have a chance to verify, validate, and examine your data, the conclusions drawn from the data have almost no scientific credibility.
5. Data analysis is important, but data re-analysis is much more important. There are many ways to analyze data, and it's hard to know when your conclusions are correct. If principles 1 through 4 are followed, the data can be re-examined at a later time. If you can re-analyze data, then the original analysis is not so critical. Sometimes, a re-analysis that occurs years or decades after the original report, fortified with new data obtained in the interim, can have enormous value and consequence.
- Jules Berman
key words: identification, identifier system, mutable, mutability, immutability, big data analysis, repeated analysis, Big Data analyses, analytic techniques for Big Data, scientific validity of Big Data, open access Big Data, public access Big Data, Big Data concepts, Jules J. Berman, Ph.D., M.D.
Specified Life
Devoted to the topic of data specification (including data organization, data description, data retrieval and data sharing) in the life sciences and in medicine.
Thursday, June 6, 2013
Wednesday, June 5, 2013
Toward Big Data Immutability
Today's blog continues yesterday's discussion of Big Data Immutability.
Big Data managers must do what seems to be impossible; they must learn how to modify data without altering the original content. The trick is accomplished with identifiers and time-stamps attached to event data (and yes, it's all discussed at greater length in my book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information).
In today's blog, let's just focus on the concept of a time-stamp. Temporal events must be given a time-stamp indicating the time that the event occurred, using a standard measurement for time. The time-stamp must be accurate, persistent, and immutable.
Time-stamps are not tamper-proof. In many instances, changing a recorded time residing in a file or data set requires nothing more than viewing the data on your computer screen and substituting one date and time for another. Dates that are automatically recorded, by your computer system, can also be altered. Operating systems permit users to reset the system date and time. Because the timing of events can be altered, scrupulous data managers employ a trusted time-stamp protocol by which a time-stamp can be verified.
Here is a description of how a trusted time-stamp protocol might work. You have just created a message, and you need to document that the message existed on the current date. You create a one-way hash on the message (a fixed-length sequence of seemingly random alphanumeric characters). You send the one-way hash sequence to your city's newspaper, with instructions to publish the sequence in the classified section of that day's late edition. You're done. Anyone questioning whether the message really existed on that particular date can perform their own one-way has on the message and compare the sequence with the sequence that was published in the city newspaper on that date. The sequences will be identical to each other.
Today, newspapers are seldom used in trusted time stamp protocols. Cautious Big Data managers employ trusted time authorities and encrypted time values to create authenticated and verifiable time-stamp data. It's all done quickly and transparently, and you end up with event data (log-ins, transactions, quantities received, observations, etc.) that are associated with an identifier, a time, and a descriptor (e.g., a tag that explains the data). When new events occur, they can be added to a data object containing related event data. The idea behind all this activity is that old data need never be replaced by new data. Your data object will always contain the information needed to distinguish one event from another, so that you can choose the event data that is appropriate to your query or your analysis.
-Jules Berman
key words: Big Data, mutable, mutability, data persistence, time stamp, time stamping, encrypted time stamp, data object, time-stamping an event
Tuesday, June 4, 2013
Consequences Of Data Mutability
Today's blog, like yesterday's blog, is based on a discussion in Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. The book's table of contents is shown in an earlier blog.
Here is an example of a immutability problem: You are a pathologist working in a university hospital that has just installed a new, $600 million information system. On Tuesday, you released a report on a surgical biopsy, indicating that it contained cancer. On Friday morning, you showed the same biopsy to your colleagues, who all agreed that the biopsy was not malignant, and contained a benign condition that simulated malignancy (looked a little like a cancer, but was not). Your original diagnosis was wrong, and now you must rectify the error. You return to the computer, and access the prior report, changing the wording of the diagnosis to indicate that the biopsy is benign. You can do this, because pathologists are granted "edit" access for pathology reports. Now, everything seems to have been set right. The report has been corrected, and the final report in the computer is official diagnosis.
Unknown to you, the patient's doctor read the incorrect report on Wednesday, the day after the incorrect report was issued, and two days before the correct report replaced the incorrect report. Major surgery was scheduled for the following Wednesday (five days after the corrected report was issued). Most of the patient's liver was removed. No cancer was found in the excised liver. Eventually, the surgeon and patient learned that the original report had been altered. The patient sued the surgeon, the pathologist, and the hospital.
You, the pathologist, argued in court that the computer held one report issued by the pathologist (following the deletion of the earlier, incorrect report) and that report was correct. Therefore, you said, you made no error. The patient's lawyer had access to a medical chart in which paper versions of the diagnosis had been kept. The lawyer produced, for the edification of the jury, two reports from the same pathologist, on the same biopsy: one positive for cancer, the other benign. The hospital, conceding that they had no credible defense, settled out of court for a very large quantity of money. Meanwhile, back in the hospital, a fastidious intern is deleting an erroneous diagnosis, and substituting his improved rendition.
One of the most important features of serious Big Data resources (such as the data collected in hospital information systems) is immutability. The rule is simple. Data is immortal and cannot change. You can add data to the system, but you can never alter data and you can never erase data. Immutability is counterintuitive to most people, including most data analysts. If a patient has a glucose level of 100 on Monday, and the same patient has a glucose level of 115 on Tuesday, then it would seem obvious that his glucose level changed. Not necessarily so. Monday's glucose level remains at 100. For the end of time, Monday's glucose level will always be 100. On Tuesday, another glucose level was added to the record for the patient. Nothing that existed prior to Tuesday was changed.
The key to maintaining immutability in Big Data resources is time-stamping. In the next blog, we will discuss how data objects hold time-stamped events.
key words: mutability, immutability, time-stamp, time stamp, altered data, data integrity
Monday, June 3, 2013
Big Data Is Immutable
Today's blog, like yesterday's blog, is based on a discussion in Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. The book's table of contents is shown in an earlier blog.
excerpt from book: "Everyone is familiar with the iconic image, from Orwell's 1984, of a totalitarian government that watches its citizens from telescreens. The ominous phrase, "Big Brother is watching you," evokes an important thesis of Orwell's masterpiece; that a totalitarian government can use an expansive surveillance system to crush its critics. Lest anyone forget, Orwell's book had a second thesis, that was, in my opinion, more insidious and more disturbing than the threat of governmental surveillance. Orwell was concerned that governments could change the past and the present by inserting, deleting, and otherwise distorting the information available to citizens. In Orwell's 1984, old reports of military defeats, genocidal atrocities, ineffective policies, mass starvation, and any ideas that might foment unrest among the proletariat, could all be deleted and replaced with propaganda pieces. Such truth-altering activities were conducted undetected, routinely distorting everyone's perception of reality to suit a totalitarian agenda. Aside from understanding the dangers inherent in a surveillance-centric society, Orwell [foretold] the dangers inherent with mutable Big Data." [i.e., when archived data that can be deleted, inserted or altered].
"One of the purposes of this book is to describe the potential negative consequences of Big Data, if the data is not collected ethically, not prepared thoughtfully, not analyzed openly, and not subjected to constant public review and correction."
In tomorrow's blog, I'll continue as discussion of mutability and immutability as they pertain to the design and maintenance of Big Data resources.
- Jules Berman
key words: Big Data, Jules J. Berman, Ph.D., M.D., data integrity, data abuse, data revision, dystopia, dystopian society, distortion of reality, big brother mentality
excerpt from book: "Everyone is familiar with the iconic image, from Orwell's 1984, of a totalitarian government that watches its citizens from telescreens. The ominous phrase, "Big Brother is watching you," evokes an important thesis of Orwell's masterpiece; that a totalitarian government can use an expansive surveillance system to crush its critics. Lest anyone forget, Orwell's book had a second thesis, that was, in my opinion, more insidious and more disturbing than the threat of governmental surveillance. Orwell was concerned that governments could change the past and the present by inserting, deleting, and otherwise distorting the information available to citizens. In Orwell's 1984, old reports of military defeats, genocidal atrocities, ineffective policies, mass starvation, and any ideas that might foment unrest among the proletariat, could all be deleted and replaced with propaganda pieces. Such truth-altering activities were conducted undetected, routinely distorting everyone's perception of reality to suit a totalitarian agenda. Aside from understanding the dangers inherent in a surveillance-centric society, Orwell [foretold] the dangers inherent with mutable Big Data." [i.e., when archived data that can be deleted, inserted or altered].
"One of the purposes of this book is to describe the potential negative consequences of Big Data, if the data is not collected ethically, not prepared thoughtfully, not analyzed openly, and not subjected to constant public review and correction."
In tomorrow's blog, I'll continue as discussion of mutability and immutability as they pertain to the design and maintenance of Big Data resources.
- Jules Berman
key words: Big Data, Jules J. Berman, Ph.D., M.D., data integrity, data abuse, data revision, dystopia, dystopian society, distortion of reality, big brother mentality
Sunday, June 2, 2013
Big Data Versus Massive Data
This post is based on a topic covered in Big Data: Preparing, Sharing, and Analyzing Complex Information, by Jules J Berman.
In yesterday's blog, we discussed the differences between Big Data and small data. Today, I wanted to briefly discuss the differences between Big Data and massive data.
Big Data is defined by the three v's:
1. Volume - large amounts of data;.
2. Variety - the data comes in different forms, including traditional databases, images, documents, complex records;.
3. Velocity - the content of the data is constantly changing, through the absorption of complementary data collections, through the introduction of previously archived data or legacy collections, and from streamed data arriving from multiple sources.
It is important to distinguish Big Data from "lotsa data" or "massive data". In a Big Data Resource, all three v's must apply. It is the size, complexity, and restlessness of Big Data resources that account for the methods by which these resources are designed, operated, and analyzed.
The term "massive data" or "lotsa data" is often applied to enormous collections of simple-format records. Massive datasets are typically equivalent to enormous spreadsheets (2-dimensional tables of columns and rows), mathematically equivalent to an immense matrix. For scientific purposes, it is sometimes necessary to analyze all of the data in a matrix, all at once. The analyses of enormous matrices is computationally intensive, and may require the resources of a supercomputer.
Big Data resources are not equivalent to a large spreadsheet, and a Big Data resource is seldom analyzed in its totality. Big Data analysis is a multi-step process whereby data is extracted, filtered, and transformed, with analysis often proceeding in a piecemeal, sometimes recursive, fashion.
If you read Big Data: Preparing, Sharing, and Analyzing Complex Information, you will find that the gulf between massive data and Big Data is profound; the two subjects can seldom be discussed productively within the same venue.
- Jules Berman
key words: Big Data, lotsa data, massive data, data analysis, data analyses, large-scale data, Big Science
In yesterday's blog, we discussed the differences between Big Data and small data. Today, I wanted to briefly discuss the differences between Big Data and massive data.
Big Data is defined by the three v's:
1. Volume - large amounts of data;.
2. Variety - the data comes in different forms, including traditional databases, images, documents, complex records;.
3. Velocity - the content of the data is constantly changing, through the absorption of complementary data collections, through the introduction of previously archived data or legacy collections, and from streamed data arriving from multiple sources.
It is important to distinguish Big Data from "lotsa data" or "massive data". In a Big Data Resource, all three v's must apply. It is the size, complexity, and restlessness of Big Data resources that account for the methods by which these resources are designed, operated, and analyzed.
The term "massive data" or "lotsa data" is often applied to enormous collections of simple-format records. Massive datasets are typically equivalent to enormous spreadsheets (2-dimensional tables of columns and rows), mathematically equivalent to an immense matrix. For scientific purposes, it is sometimes necessary to analyze all of the data in a matrix, all at once. The analyses of enormous matrices is computationally intensive, and may require the resources of a supercomputer.
Big Data resources are not equivalent to a large spreadsheet, and a Big Data resource is seldom analyzed in its totality. Big Data analysis is a multi-step process whereby data is extracted, filtered, and transformed, with analysis often proceeding in a piecemeal, sometimes recursive, fashion.
If you read Big Data: Preparing, Sharing, and Analyzing Complex Information, you will find that the gulf between massive data and Big Data is profound; the two subjects can seldom be discussed productively within the same venue.
- Jules Berman
key words: Big Data, lotsa data, massive data, data analysis, data analyses, large-scale data, Big Science
Saturday, June 1, 2013
Differences between Big Data and Small Data
Excerpt Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information,by Jules J Berman (see yesterday's blog).
1. Goals
- small data-Usually designed to answer a specific question or serve a particular goal.
- Big Data-Usually designed with a goal in mind, but the goal is flexible and the questions posed are protean.
2. Location
- small data-Typically, small data is contained within one institution, often on one computer, sometimes in one file.
- Big Data-Typically spread throughout electronic space, typically parceled onto multiple Internet servers, located anywhere on earth.
3. Data structure and content
- small data-Ordinarily contains highly structured data. The data domain is restricted to a single discipline or subdiscipline. The data often comes in the form of uniform records in an ordered spreadsheet.
- Big Data-Must be capable of absorbing unstructured data (e.g., such as free-text documents, images, motion pictures, sound recordings, physical objects). The subject matter of the resource may cross multiple disciplines, and the individual data objects in the resource may link to data contained in other, seemingly unrelated, Big Data resources.
4. Data preparation
- small data-In many cases, the data user prepares her own data, for her own purposes.
- Big Data-The data comes from many diverse sources, and it is prepared by many people. People who use the data are seldom the people who have prepared the data.
5. Longevity
- small data-When the data project ends, the data is kept for a limited time and then discarded.
- Big Data-Big Data projects typically contain data that must be stored in perpetuity. Ideally, data stored in a Big Data resource will be absorbed into another resource when the original resource terminates. Many Big Data projects extend into the future and the past (e.g., legacy data), accruing data prospectively and retrospectively.
6. Measurements
- small data-Typically, the data is measured using one experimental protocol, and the data can be represented using one set of standard units (see Glossary item, Protocol).
- Big Data -Many different types of data are delivered in many different electronic formats. Measurements, when present, may be obtained by many different protocols. Verifying the quality of Big Data is one of the most difficult tasks for data managers.
7. Reproducibility
- small data-Projects are typically repeatable. If there is some question about the quality of the data, reproducibility of the data, or validity of the conclusions drawn from the data, the entire project can be repeated, yielding a new data set.
- Big Data-Replication of a Big Data project is seldom feasible. In most instances, all that anyone can hope for is that bad data in a Big Data resource will be found and flagged as such.
8. Stakes
- small data-Project costs are limited. Laboratories and institutions can usually recover from the occasional small data failure.
- Big Data-Big Data projects can be obscenely expensive. A failed Big Data effort can lead to bankruptcy, institutional collapse, mass firings, and the sudden disintegration of all the data held in the resource.
9. Introspection
- small data-Individual data points are identified by their row and column location within a spreadsheet or database table (see Glossary item, Data point). If you know the row and column headers, you can find and specify all of the data points contained within.
- Big Data-Unless the Big Data resource is exceptionally well designed, the contents and organization of the resource can be inscrutable, even to the data managers (see Glossary item, Data manager). Complete access to data, information about the data values, and information about the organization of the data is achieved through a technique herein referred to as introspection (see Glossary item, Introspection).
10. Analysis
- small data-In most instances, all of the data contained in the data project can be analyzed together, and all at once.
- Big Data-With few exceptions, such as those conducted on supercomputers or in parallel on multiple computers, Big Data is ordinarily analyzed in incremental steps (see Glossary items, Parallel computing, MapReduce). The data are extracted, reviewed, reduced, normalized, transformed, visualized, interpreted, and reanalyzed with different methods.
Friday, May 31, 2013
Big Data Book Explained

My Big Data book
In yesterday's blog, I announced the publication of my new book, Principles of Big Data, 1st Edition: Preparing, Sharing, and Analyzing Complex Information. Here is a short essay describing some of the features that distinguish this Big Data book from all of the others.
The book describes:
- How do deal with complex data objects (unstructured text, categorical data, quantitative data, images, etc.), and how to extract small data sets (the kind you're probably familiar with), from Big Data resources.
- How to create Big Data resources in a legal, ethical and scientifically sensible manner.
- How to inspect and analyze Big Data.
- How to verify and validate the data and the conclusions drawn from the data.
- Identifiers and deidentification. Most people simply do not understand the importance of creating unique identifiers for data objects. In the book, I build an argument that Big Data resources are basically just a set of identifiers to which data is attached. Without proper identifiers there can be no useful analysis of Big Data. The book goes into some detail explaining how data objects can be identified and de-identified, and how data identifiers are crucial for merging data obtained from heterogeneous data sources.
- Metadata, Classification, and Introspection. Classifications drive down the complexity of Big Data, and metadata permits data objects to contain self-descriptive information about the data contained in the object, and the placement of the data object within the classification. The ability of data objects to provide information about themselves is called introspection. It is another property of serious Big Data resources that allows data from different resources to be shared and merged.
- Immutability. The data within Big Data resources must be immutable. This means that if a data objects has a certain value at a certain time, then it will retain that value forever. From the medical field, if I have a glucose level of 85 on Tuesday, and if I have a new measurement taken on Friday, which tells me that my glucose level is 105, then the glucose level of 105 does not replace the earlier level. It merely, creates a second value, with its own time-stamp and its own identifier, both belonging to some data object (e.g., the data object that contains the lab tests of Jules Berman). In the book, I emphasize the practical importance of building immutability into Big Data resources, and how this might be achieved.
- Estimation. I happen to believe that almost every Big Data project should start off with a quick and dirty estimation. Most of the sophisticated analytic methods simply improve upon simple and intuitive data analysis methods. I devote several chapters to describing how data users should approach Big Data projects, how to assess the available data, and how to quickly estimate your results.
- Failures. It may come as a shock, but most Big Data efforts fail. Money, staff, and expertise cannot compensate for resources that overlook the fundamental properties of Big Data. In this book, I discuss and dissect some of the more spectacular failures, including several from the realm of Big Science.
Amazon has already released its Kindle version of the book.
Jules Berman
tags: Big Data, Jules J. Berman, mutability, de-identification, troubleshooting Big Data
Thursday, May 30, 2013
Big Data Book Contents
I've taken a hiatus from the Specified Life blog while I wrote my latest book, entitled, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.

The Kindle edition is available now, and Amazon has a "look-inside" option on their book page. The print version will be available in a week or two, and Amazon is taking pre-orders Here is the complete Table of Contents:
Jules Berman
key words: big data, heterogeneous data, complex datasets, Jules J. Berman, Ph.D., M.D., immutability, introspection, identifiers, de-identification, deidentification, confidentiality, privacy, massive data, lotsa data

The Kindle edition is available now, and Amazon has a "look-inside" option on their book page. The print version will be available in a week or two, and Amazon is taking pre-orders Here is the complete Table of Contents:
Acknowledgments xi Author Biography xiii Preface xv Introduction xix 1. Providing Structure to Unstructured Data Background 1 Machine Translation 2 Autocoding 4 Indexing 9 Term Extraction 11 2. Identification, Deidentification, and Reidentification Background 15 Features of an Identifier System 17 Registered Unique Object Identifiers 18 Really Bad Identifier Methods 22 Embedding Information in an Identifier: Not Recommended 24 One-Way Hashes 25 Use Case: Hospital Registration 26 Deidentification 28 Data Scrubbing 30 Reidentification 31 Lessons Learned 32 3. Ontologies and Semantics Background 35 Classifications, the Simplest of Ontologies 36 Ontologies, Classes with Multiple Parents 39 Choosing a Class Model 40 Introduction to Resource Description Framework Schema 44 Common Pitfalls in Ontology Development 46 4. Introspection Background 49 Knowledge of Self 50 eXtensible Markup Language 52 Introduction to Meaning 54 Namespaces and the Aggregation of Meaningful Assertions 55 Resource Description Framework Triples 56 Reflection 59 Use Case: Trusted Time Stamp 59 Summary 60 5. Data Integration and Software Interoperability Background 63 The Committee to Survey Standards 64 Standard Trajectory 65 Specifications and Standards 69 Versioning 71 Compliance Issues 73 Interfaces to Big Data Resources 74 6. Immutability and Immortality Background 77 Immutability and Identifiers 78 Data Objects 80 Legacy Data 82 Data Born from Data 83 Reconciling Identifiers across Institutions 84 Zero-Knowledge Reconciliation 86 The Curator’s Burden 87 7. Measurement Background 89 Counting 90 Gene Counting 93 Dealing with Negations 93 Understanding Your Control 95 Practical Significance of Measurements 96 Obsessive-Compulsive Disorder: The Mark of a Great Data Manager 97 8. Simple but Powerful Big Data Techniques Background 99 Look at the Data 100 Data Range 110 Denominator 112 Frequency Distributions 115 Mean and Standard Deviation 119 Estimation-Only Analyses 122 Use Case: Watching Data Trends with Google Ngrams 123 Use Case: Estimating Movie Preferences 126 9. Analysis Background 129 Analytic Tasks 130 Clustering, Classifying, Recommending, and Modeling 130 Data Reduction 134 Normalizing and Adjusting Data 137 Big Data Software: Speed and Scalability 139 Find Relationships, Not Similarities 141 10. Special Considerations in Big Data Analysis Background 145 Theory in Search of Data 146 Data in Search of a Theory 146 Overfitting 148 Bigness Bias 148 Too Much Data 151 Fixing Data 152 Data Subsets in Big Data: Neither Additive nor Transitive 153 Additional Big Data Pitfalls 154 11. Stepwise Approach to Big Data Analysis Background 157 Step 1. A Question Is Formulated 158 Step 2. Resource Evaluation 158 Step 3. A Question Is Reformulated 159 Step 4. Query Output Adequacy 160 Step 5. Data Description 161 Step 6. Data Reduction 161 Step 7. Algorithms Are Selected, If Absolutely Necessary 162 Step 8. Results Are Reviewed and Conclusions Are Asserted 164 Step 9. Conclusions Are Examined and Subjected to Validation 164 12. Failure Background 167 Failure Is Common 168 Failed Standards 169 Complexity 172 When Does Complexity Help? 173 When Redundancy Fails 174 Save Money; Don’t Protect Harmless Information 176 After Failure 177 Use Case: Cancer Biomedical Informatics Grid, a Bridge Too Far 178 13. Legalities Background 183 Responsibility for the Accuracy and Legitimacy of Contained Data 184 Rights to Create, Use, and Share the Resource 185 Copyright and Patent Infringements Incurred by Using Standards 187 Protections for Individuals 188 Consent 190 Unconsented Data 194 Good Policies Are a Good Policy 197 Use Case: The Havasupai Story 198 14. Societal Issues Background 201 How Big Data Is Perceived 201 The Necessity of Data Sharing, Even When It Seems Irrelevant 204 Reducing Costs and Increasing Productivity with Big Data 208 Public Mistrust 210 Saving Us from Ourselves 211 Hubris and Hyperbole 213 15. The Future Background 217 Last Words 226 Glossary 229 References 247 Index 257In the next few days, I'll be posting short excerpts from the book, along with commentary. Best,
Jules Berman
key words: big data, heterogeneous data, complex datasets, Jules J. Berman, Ph.D., M.D., immutability, introspection, identifiers, de-identification, deidentification, confidentiality, privacy, massive data, lotsa data
Monday, June 27, 2011
Patient identifiers
I have just posted an article on patient identifiers. Here is a short excerpt from the article:
These are the kinds of problems that arise when hospitals lack a proper patient identifier system (a common shortcoming). The purpose of the article is to list the features of a patient identifier system, emphasizing the essential role of identifiers in healthcare services and biomedical research.
The full-length article is available at:
http://www.julesberman.info/book/id_deid.htm
© 2011 Jules J. Berman
Imagine this scenario. You show up for treatment in the hospital where you were born, and in which you have been seen for various ailments over the past three decades. One of the following events transpires:
1. The hospital has a medical record of someone with your name, but it's not you. After much effort, they find another medical record with your name. Once again, it's the wrong person. After much time and effort, you are told that the hospital has no record for you.
2. The hospital has your medical record. After a few minutes with your doctor, it becomes obvious to both of you that the record is missing a great deal of information, relating to tests and procedures done recently and in the distant past. Nobody can find these missing records. You ask your doctor whether your records may have been inserted into the electronic chart of another patient or of multiple patients. The doctor does not answer your question.
3. The hospital has your medical record, but after a few moments, it becomes obvious that the record includes a variety of tests done on patients other than yourself. Some of the other patients have your name. Others have a different name. Nobody seems to understand how these records got into your chart.
4. You are informed that the hospital has changed its hospital information system, and your old electronic records are no longer available. You are asked to answer a long list of questions concerning your medical history. Your answers will be added to your new medical chart. You can't answer any of the questions with much certainty.
5. You are told that your electronic record was transferred to the hospital information system of a large multi-hospital system. This occurred as a consequence of a complex acquisition and merger. The hospital in which you are seeking care has not yet been deployed within the information structure of the multi-hospital system and has no access to your record. You are assured that the record has not been lost and will be accessible within the decade.
6. You arrive at your hospital to find that it has been demolished and replaced by a shopping center. Your electronic records are gone forever.
These are the kinds of problems that arise when hospitals lack a proper patient identifier system (a common shortcoming). The purpose of the article is to list the features of a patient identifier system, emphasizing the essential role of identifiers in healthcare services and biomedical research.
The full-length article is available at:
http://www.julesberman.info/book/id_deid.htm
© 2011 Jules J. Berman
Thursday, March 31, 2011
Post-Informatics Pathology
For those who have been reading my blogs sequentially, I apologize for my lapse in the google ngram series. I've been preoccupied with other projects, but I hope to pick up where I left off, soon.
In the meantime, the Journal of Pathology Informatics has just published my article on "Post-Informatics Pathology." It is available at:
http://www.jpathinformatics.org/text.asp?2011/2/1/18/78499
- Jules Berman
In the meantime, the Journal of Pathology Informatics has just published my article on "Post-Informatics Pathology." It is available at:
http://www.jpathinformatics.org/text.asp?2011/2/1/18/78499
- Jules Berman
Monday, January 3, 2011
Google ngram medical research 2
In yesterday's blog, we began a series in which we'll discuss using Google's ngram data for medical research. We showed that with Google's ngram viewer, you can enter a word or phrase and find the frequency of occurrences of the phrase in books collected over the past half-millennium. The ngram viewer is intended to show us how particular words and phrases grow or wane in popularity.
There are now many websites that discuss the ngram viewer, but they all seem to be stuck in the realms of culture and literature; nobody seems to be using the ngram viewer for medical research [if this observation is incorrect, please send me a comment].
Words and phrases can tell us a lot about the patterns of disease. With the Google ngram collection, we can answer questions for which there is no other source of informative data [i.e., no historical data, and no existing collections of past observations or measurements]. We saw a few examples in yesterday's blog.
The drawback to Google's ngram user is that it produces one-off graphs from a single or small number of words and phrases, and performs a particular type of calculation (word/phrase occurrences as a percentage of total for a particular year).
When you're interested on analyzing a large dataset, you really want to do a global analysis over the data (i.e., analyzing the occurrences of every word or phrase, measured by all possible parameters, all at once). Then, when you start to mine the resulting data, you can look for any kind of trend, among any or all ways of grouping the data.
To understand the problem, let's look at two records in the Google dataset (provided in Google's ngram download page, .
circumvallate 1978 313 215 85
circumvallate 1979 183 147 77
The 1-gram "circumvallate" occurs 313 times in the 1978 literature, appearing on 215 pages, and in 85 books. In 1979, circumvallate occured 183 times, on 147 pages, in a total of 77 books. Depending on our question, we might be interested in the trends of word/phrases expressed as any of of these three parameters (total occurrences, page occurrences, book occurrences).
In the case of a medical term, we might be interested in combining the data for a word with all of its synonyms or plesionyms (near-synonyms). For example, we might want to sum the data for renal carcinoma, kidney cancer, renal ca, kidney ca, renal carcinoma, kidney carcinoma, carcinoma of the kidney, carcinoma of the kidneys, and so on.
Beyond the occurrence of near-synonymous terms, we might want to group classes of terms (e.g., all tumors, diseases spread by insects).
We might want to know the specific year that a term first came into use, or the specific year after which a term ceased to occur in the literature.
We might want to confine our attention to books that contain specific types of terms (e.g., names of diseases) and to produce a frequency calculation that excludes books that do not contain names of diseases.
We might want to look at the frequency order of terms or groups of terms in a particular publication year.
We might want to combine ngram data with relevant data included in other datasets.
All of these examples, and many more, cannot be accomplished by using Google's public ngram viewer.
The only way we can make any progress with these kinds of questions is to download the ngram data and write our own scripts to analyze the data.
In the next few blogs, I'll provide step-by-step instructions for acquiring, parsing, and analyzing the ngram data.
- © 2011 Jules Berman
key words: ngrams, Google ngram viewer, doublets, indexing, index, information retrieval, medical informatics, methods, translational research, data mining, datamining
There are now many websites that discuss the ngram viewer, but they all seem to be stuck in the realms of culture and literature; nobody seems to be using the ngram viewer for medical research [if this observation is incorrect, please send me a comment].
Words and phrases can tell us a lot about the patterns of disease. With the Google ngram collection, we can answer questions for which there is no other source of informative data [i.e., no historical data, and no existing collections of past observations or measurements]. We saw a few examples in yesterday's blog.
The drawback to Google's ngram user is that it produces one-off graphs from a single or small number of words and phrases, and performs a particular type of calculation (word/phrase occurrences as a percentage of total for a particular year).
When you're interested on analyzing a large dataset, you really want to do a global analysis over the data (i.e., analyzing the occurrences of every word or phrase, measured by all possible parameters, all at once). Then, when you start to mine the resulting data, you can look for any kind of trend, among any or all ways of grouping the data.
To understand the problem, let's look at two records in the Google dataset (provided in Google's ngram download page, .
circumvallate 1978 313 215 85
circumvallate 1979 183 147 77
The 1-gram "circumvallate" occurs 313 times in the 1978 literature, appearing on 215 pages, and in 85 books. In 1979, circumvallate occured 183 times, on 147 pages, in a total of 77 books. Depending on our question, we might be interested in the trends of word/phrases expressed as any of of these three parameters (total occurrences, page occurrences, book occurrences).
In the case of a medical term, we might be interested in combining the data for a word with all of its synonyms or plesionyms (near-synonyms). For example, we might want to sum the data for renal carcinoma, kidney cancer, renal ca, kidney ca, renal carcinoma, kidney carcinoma, carcinoma of the kidney, carcinoma of the kidneys, and so on.
Beyond the occurrence of near-synonymous terms, we might want to group classes of terms (e.g., all tumors, diseases spread by insects).
We might want to know the specific year that a term first came into use, or the specific year after which a term ceased to occur in the literature.
We might want to confine our attention to books that contain specific types of terms (e.g., names of diseases) and to produce a frequency calculation that excludes books that do not contain names of diseases.
We might want to look at the frequency order of terms or groups of terms in a particular publication year.
We might want to combine ngram data with relevant data included in other datasets.
All of these examples, and many more, cannot be accomplished by using Google's public ngram viewer.
The only way we can make any progress with these kinds of questions is to download the ngram data and write our own scripts to analyze the data.
In the next few blogs, I'll provide step-by-step instructions for acquiring, parsing, and analyzing the ngram data.
- © 2011 Jules Berman
key words: ngrams, Google ngram viewer, doublets, indexing, index, information retrieval, medical informatics, methods, translational research, data mining, datamining
Sunday, January 2, 2011
Medical research with google ngrams
This blog post marks the beginning of a series of articles on the general topic of indexing. Eventually, I'll get to standard back-of-book indexing, but I'm going to start with an advanced topic: ngram indexing.
Ngrams are the ordered word sequences in text.
If a text string is:
"Say hello to the cat"
The ngrams are:
say (1-gram or singlet or singleton)
hello (1-gram or singlet or singleton)
to (1-gram or singlet or singleton)
the (1-gram or singlet or singleton)
cat (1-gram or singlet or singleton)
say hello (2-gram or doublet)
hello to (2-gram or doublet)
to the (2-gram or doublet)
the cat (2-gram or doublet)
say hello to (3-gram or triplet)
hello to the (3-gram or triplet)
to the cat (3-gram or triplet)
say hello to the (4-gram or quadruplet)
hello to the cat (4-gram or quadruplet)
say hello to the cat (5-gram or quint or quintuplet)
Google has undertaken a massive effort to enumerate the ngrams collected from the scanned literature dating back to 1500. Moreover, Google has released the ngram files to the public.
The files are available for download at:
http://ngrams.googlelabs.com/datasets
We can use Google's own ngram viewer to do our own epidemiologic research.
When we look at the frequency of occurrence of the 2-gram "yellow fever" we get the following Google output.

Click on image for larger view
We see that the term "yellow fever" (a mosquito-transmitted hepatitis) appeared in the literature beginning about 1800 (the time of its largest peak), with several subsequent peaks (around 1915 and 1945). The dates of the three peaks correspond roughly to outbreaks of yellow fever in Philadelphia (1993, with thousands of deaths), the construction of the Panama canal (finished in 1914, after incurring over 5,000 deaths), and WWII Pacific outbreaks, countered by mass immunizations with a new, and unproven yellow fever vaccine. In this case, a simple review of n-gram "traffic" provides an accurate view of the yellow fever outbreaks.
Let's see the n-gram occurrence graph for "lung cancer".

Click on image for larger view
There is virtually no mention of lung cancer before the 20th century. Why? Because lung cancer was rare before the introduction of cigarettes. Here is what Wikipedia has to say about cigarette smoking through the twentieth century. "The widespread smoking of cigarettes in the Western world is largely a 20th century phenomenon – at the start of the century the per capita annual consumption in the USA was 54 cigarettes (with less than 0.5% of the population smoking more than 100 cigarettes per year)".
While lung cancer did not occur in great frequency until the twentieth century, gastric cancer has been around quite a while. In fact, the incidence of stomach cancer has been dropping in the last half of the twentieth century, [presumably due to refrigeration, other safe methods of food preservation, and the general availability of potable water in industrialized countries]. Here's the ngram graph for gastric cancer.

Click on image for larger view
Notice that the graph has about the same shape whether it's searching gastric cancer or stomach cancer or related synonyms. This tells us that the "traffic" for a medical term and its synonyms can provides similar trends (but with differing amplitudes allowing for usage).
Finally, let's look at my favorite subject in tumor biology, the precancers.

Click on image for larger view
Precancer terms have occurred with increasing frequency in the twentieth century (perhaps indicating the importance of this class of lesions).
Searching for medical ngrams, using Google's ngram viewer has some scientific merit. If we want to get the most out of the ngram files, we will need to do a global analysis of the ngram data related to medical terms. This means we will need to download the ngram data sets and write our own scripts that can analyze the occurrences of every term of interest, all at once, finding correlations of medical significance.
Jump to tomorrow's blog to continue this discussion.
- © 2011 Jules Berman
key words: ngrams, doublets, indexing, index, information retrieval, medical informatics, methods
Ngrams are the ordered word sequences in text.
If a text string is:
"Say hello to the cat"
The ngrams are:
say (1-gram or singlet or singleton)
hello (1-gram or singlet or singleton)
to (1-gram or singlet or singleton)
the (1-gram or singlet or singleton)
cat (1-gram or singlet or singleton)
say hello (2-gram or doublet)
hello to (2-gram or doublet)
to the (2-gram or doublet)
the cat (2-gram or doublet)
say hello to (3-gram or triplet)
hello to the (3-gram or triplet)
to the cat (3-gram or triplet)
say hello to the (4-gram or quadruplet)
hello to the cat (4-gram or quadruplet)
say hello to the cat (5-gram or quint or quintuplet)
Google has undertaken a massive effort to enumerate the ngrams collected from the scanned literature dating back to 1500. Moreover, Google has released the ngram files to the public.
The files are available for download at:
http://ngrams.googlelabs.com/datasets
We can use Google's own ngram viewer to do our own epidemiologic research.
When we look at the frequency of occurrence of the 2-gram "yellow fever" we get the following Google output.

We see that the term "yellow fever" (a mosquito-transmitted hepatitis) appeared in the literature beginning about 1800 (the time of its largest peak), with several subsequent peaks (around 1915 and 1945). The dates of the three peaks correspond roughly to outbreaks of yellow fever in Philadelphia (1993, with thousands of deaths), the construction of the Panama canal (finished in 1914, after incurring over 5,000 deaths), and WWII Pacific outbreaks, countered by mass immunizations with a new, and unproven yellow fever vaccine. In this case, a simple review of n-gram "traffic" provides an accurate view of the yellow fever outbreaks.
Let's see the n-gram occurrence graph for "lung cancer".

There is virtually no mention of lung cancer before the 20th century. Why? Because lung cancer was rare before the introduction of cigarettes. Here is what Wikipedia has to say about cigarette smoking through the twentieth century. "The widespread smoking of cigarettes in the Western world is largely a 20th century phenomenon – at the start of the century the per capita annual consumption in the USA was 54 cigarettes (with less than 0.5% of the population smoking more than 100 cigarettes per year)".
While lung cancer did not occur in great frequency until the twentieth century, gastric cancer has been around quite a while. In fact, the incidence of stomach cancer has been dropping in the last half of the twentieth century, [presumably due to refrigeration, other safe methods of food preservation, and the general availability of potable water in industrialized countries]. Here's the ngram graph for gastric cancer.

Notice that the graph has about the same shape whether it's searching gastric cancer or stomach cancer or related synonyms. This tells us that the "traffic" for a medical term and its synonyms can provides similar trends (but with differing amplitudes allowing for usage).
Finally, let's look at my favorite subject in tumor biology, the precancers.

Precancer terms have occurred with increasing frequency in the twentieth century (perhaps indicating the importance of this class of lesions).
Searching for medical ngrams, using Google's ngram viewer has some scientific merit. If we want to get the most out of the ngram files, we will need to do a global analysis of the ngram data related to medical terms. This means we will need to download the ngram data sets and write our own scripts that can analyze the occurrences of every term of interest, all at once, finding correlations of medical significance.
Jump to tomorrow's blog to continue this discussion.
- © 2011 Jules Berman
key words: ngrams, doublets, indexing, index, information retrieval, medical informatics, methods
Thursday, November 4, 2010
Machiavelli's Laboratory: update available
Machiavelli's Laboratory is a free ebook that I published on April 13, 2010. It is a satiric discourse on scientific ethics, from the perspective of an unethical scientist.
I just posted the latest version of the book at:
http://www.julesberman.info/integ/machfree.htm
Updated PDF and MobiPocket formats are also available at the site.
- Jules Berman
I just posted the latest version of the book at:
http://www.julesberman.info/integ/machfree.htm
Updated PDF and MobiPocket formats are also available at the site.
- Jules Berman
Wednesday, October 27, 2010
Extracting names from text file
I'm beginning, for this blog, a series of short utility scripts and essays that relate, in one way or another, to the general subject of indexing and data retrieval.
The first entry is a short Perl script (just 18 command lines) that extracts the names (of people) wherever the names may occur within a provided text file. The output consists of an alphabetized list of non-repeating names. The script is so simple that it can be easily be translated into any language that supports regular expressions (regex).
The script is available at:
http://www.julesberman.info/factoids/namesget.htm
Blog readers who are uninterested in indexing and data retrieval may want to visit my two other blogs,
Machiavelli's Laboratory (scientific ethics taught from the perspective on an unethical scientist)
and
Neoplasms (essays on tumor biology)
- © 2010 Jules Berman
key words: indices, indexing, indexes, index, data retrieval, information retrieval, informatics
The first entry is a short Perl script (just 18 command lines) that extracts the names (of people) wherever the names may occur within a provided text file. The output consists of an alphabetized list of non-repeating names. The script is so simple that it can be easily be translated into any language that supports regular expressions (regex).
The script is available at:
http://www.julesberman.info/factoids/namesget.htm
Blog readers who are uninterested in indexing and data retrieval may want to visit my two other blogs,
Machiavelli's Laboratory (scientific ethics taught from the perspective on an unethical scientist)
and
Neoplasms (essays on tumor biology)
- © 2010 Jules Berman
key words: indices, indexing, indexes, index, data retrieval, information retrieval, informatics
Thursday, October 14, 2010
Germ cell tumor web page available
The recent blog series on germ cell tumors has been packaged into a single web page available at:
http://www.julesberman.info/factoids/germcell.htm
- © 2010 Jules Berman
key words: carcinogenesis, neoplasia, neoplasms, tumor development, tumour development, tumor biology, tumour biology, carcinogenesis
http://www.julesberman.info/factoids/germcell.htm
- © 2010 Jules Berman
key words: carcinogenesis, neoplasia, neoplasms, tumor development, tumour development, tumor biology, tumour biology, carcinogenesis
Subscribe to:
Posts (Atom)





