Thursday, June 6, 2013

Condensed Principles of Big Data

Last night I re-read yesterday's post (Toward Big Data Immutability), and I realized that there really is no effective way to use this blog to teach anyone the mechanics of Big Data construction and analysis.  My guess is that many readers were confused by the blog, because a single post cannot provide the back-story to the concepts included in the post.

So, basically, I give up.  If you want to learn the fundamentals of Big Data, you'll need to do some reading  I would recommend my own book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.  Depending on your background and agenda, you might prefer one of the hundreds of other books written for this vibrant field (I won't be offended).

The best I can do is to summarize, with a few principles, the basic theme of my book.

1. You cannot create a good Big Data resource without good identifiers.  A Big Data resource can be usefully envisioned as a system of identifiers to which data is attached.

2. Data must be described with metadata, and the metadata descriptors should be organized under a classification or an ontology.  The latter will drive down the complexity of the system and will permit heterogeneous data to be shared, merged, and queried across systems.

3. Big Data must be immutable.  You can add to Big Data, but you can never alter or delete the contained data.

4. Big Data must be accessible to the public if it is to have any scientific value.  Unless members of the public have a chance to verify, validate, and examine your data, the conclusions drawn from the data have almost no scientific credibility.

5. Data analysis is important, but data re-analysis is much more important.  There are many ways to analyze data, and it's hard to know when your conclusions are correct.  If principles 1 through 4 are followed, the data can be re-examined at a later time.  If you can re-analyze data, then the original analysis is not so critical. Sometimes, a re-analysis that occurs years or decades after the original report, fortified with new data obtained in the interim, can have enormous value and consequence.

- Jules Berman

key words: identification, identifier system, mutable, mutability, immutability, big data analysis, repeated analysis, Big Data analyses, analytic techniques for Big Data, scientific validity of Big Data, open access Big Data, public access Big Data, Big Data concepts, Jules J. Berman, Ph.D., M.D.

Wednesday, June 5, 2013

Toward Big Data Immutability


Today's blog continues yesterday's discussion of Big Data Immutability.

Big Data managers must do what seems to be impossible; they must learn how to modify data without altering the original content.  The trick is accomplished with identifiers and time-stamps attached to event data (and yes, it's all discussed at greater length in my book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information).

In today's blog, let's just focus on the concept of a time-stamp. Temporal events must be given a time-stamp indicating the time that the event occurred, using a standard measurement for time. The time-stamp must be accurate, persistent, and immutable.

Time-stamps are not tamper-proof. In many instances, changing a recorded time residing in a file or data set requires nothing more than viewing the data on your computer screen and substituting one date and time for another.  Dates that are automatically recorded, by your computer system, can also be altered. Operating systems permit users to reset the system date and time.  Because the timing of events can be altered, scrupulous data managers employ a trusted time-stamp protocol by which a time-stamp can be verified.

Here is a description of how a trusted time-stamp protocol might work.  You have just created a message, and you need to document that the message existed on the current date.  You create a one-way hash on the message (a fixed-length sequence of seemingly random alphanumeric characters). You send the one-way hash sequence to your city's newspaper, with instructions to publish the sequence in the classified section of that day's late edition. You're done.  Anyone questioning whether the message really existed on that particular date can perform their own one-way has on the message and compare the sequence with the sequence that was published in the city newspaper on that date.  The sequences will be identical to each other.

Today, newspapers are seldom used in trusted time stamp protocols.  Cautious Big Data managers employ trusted time authorities and encrypted time values to create authenticated and verifiable time-stamp data.  It's all done quickly and transparently, and you end up with event data (log-ins, transactions, quantities received, observations, etc.) that are associated with an identifier, a time, and a descriptor (e.g., a tag that explains the data).  When new events occur, they can be added to a data object containing related event data.  The idea behind all this activity is that old data need never be replaced by new data.  Your data object will always contain the information needed to distinguish one event from another, so that you can choose the event data that is appropriate to your query or your analysis.

-Jules Berman

key words: Big Data, mutable, mutability, data persistence, time stamp, time stamping, encrypted time stamp, data object, time-stamping an event, archiving, dystopia, George Orwell, newspeak, persistence, persistent data, saving data, time-stamp

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Tuesday, June 4, 2013

Consequences Of Data Mutability

Today's blog, like yesterday's blog, is based on a discussion in Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. The book's table of contents is shown in an earlier blog.

Here is an example of a immutability problem:  You are a pathologist working in a university hospital that has just installed a new, $600 million information system. On Tuesday, you released a report on a surgical biopsy, indicating that it contained cancer. On Friday morning, you showed the same biopsy to your colleagues, who all agreed that the biopsy was not malignant, and contained a benign condition that simulated malignancy (looked a little like a cancer, but was not).  Your original diagnosis was wrong, and now you must rectify the error.  You return to the computer, and access the prior report, changing the wording of the diagnosis to indicate that the biopsy is benign.  You can do this, because pathologists are granted "edit" access for pathology reports.  Now, everything seems to have been set right.  The report has been corrected, and the final report in the computer is official diagnosis.

Unknown to you, the patient's doctor read the incorrect report on Wednesday, the day after the incorrect report was issued, and two days before the correct report replaced the incorrect report. Major surgery was scheduled for the following Wednesday (five days after the corrected report was issued).  Most of the patient's liver was removed.  No cancer was found in the excised liver.  Eventually, the surgeon and patient learned that the original report had been altered.  The patient sued the surgeon, the pathologist, and the hospital.

You, the pathologist, argued in court that the computer held one report issued by the pathologist (following the deletion of the earlier, incorrect report) and that report was correct.  Therefore, you said, you made no error.  The patient's lawyer had access to a medical chart in which paper versions of the diagnosis had been kept.  The lawyer produced, for the edification of the jury, two reports from the same pathologist, on the same biopsy: one positive for cancer, the other benign.  The hospital, conceding that they had no credible defense, settled out of court for a very large quantity of money. Meanwhile, back in the hospital, a fastidious intern is deleting an erroneous diagnosis, and substituting his improved rendition.

One of the most important features of serious Big Data resources (such as the data collected in hospital information systems) is immutability.  The rule is simple.  Data is immortal and cannot change.  You can add data to the system, but you can never alter data and you can never erase data.  Immutability is counterintuitive to most people, including most data analysts.  If a patient has a glucose level of 100 on Monday, and the same patient has a glucose level of 115 on Tuesday, then it would seem obvious that his glucose level changed.  Not necessarily so.  Monday's glucose level remains at 100.  For the end of time, Monday's glucose level will always be 100.  On Tuesday, another glucose level was added to the record for the patient.  Nothing that existed prior to Tuesday was changed.

The key to maintaining immutability in Big Data resources is time-stamping.  In the next blog, we will discuss how data objects hold time-stamped events. 


key words: mutability, archiving, dystopia, George Orwell, newspeak, persistence, persistent data, saving data, immutability, time-stamp, time stamp, altered data, data integrity 

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Monday, June 3, 2013

Big Data Is Immutable

Today's blog, like yesterday's blog, is based on a discussion in Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information.  The book's table of contents is shown in an earlier blog.

excerpt from book: "Everyone is familiar with the iconic image, from Orwell's 1984, of a totalitarian government that watches its citizens from telescreens. The ominous phrase, "Big Brother is watching you," evokes an important thesis of Orwell's masterpiece; that a totalitarian government can use an expansive surveillance system to crush its critics.  Lest anyone forget, Orwell's book had a second thesis, that was, in my opinion, more insidious and more disturbing than the threat of governmental surveillance.  Orwell was concerned that governments could change the past and the present by inserting, deleting, and otherwise distorting the information available to citizens.  In Orwell's 1984, old reports of military defeats, genocidal atrocities, ineffective policies, mass starvation, and any ideas that might foment unrest among the proletariat, could all be deleted and replaced with propaganda pieces.  Such truth-altering activities were conducted undetected, routinely distorting everyone's perception of reality to suit a totalitarian agenda. Aside from understanding the dangers inherent in a surveillance-centric society, Orwell [foretold] the dangers inherent with mutable Big Data." [i.e., when archived data that can be deleted, inserted or altered].

"One of the purposes of this book is to describe the potential negative consequences of Big Data, if the data is not collected ethically, not prepared thoughtfully, not analyzed openly, and not subjected to constant public review and correction."

In tomorrow's blog, I'll continue as discussion of mutability and immutability as they pertain to the design and maintenance of Big Data resources.

- Jules Berman 

key words: Big Data, Jules J. Berman, Ph.D., M.D., data integrity, data abuse, data revision, dystopia, dystopian society, distortion of reality, big brother mentality, archiving, dystopia, George Orwell, newspeak, persistence, persistent data, saving data, time-stamp, immutable, immutability, privacy, confidentiality, Big Brother

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Sunday, June 2, 2013

Big Data Versus Massive Data

This post is based on a topic covered in Big Data: Preparing, Sharing, and Analyzing Complex Information, by Jules J Berman.

In yesterday's blog, we discussed the differences between Big Data and small data.  Today,   I wanted to briefly discuss the differences between Big Data and massive data.

Big Data is defined by the three v's: 

1. Volume - large amounts of data;.

2. Variety - the data comes in different forms, including traditional databases, images, documents, complex records;.

3. Velocity - the content of the data is constantly changing, through the absorption of complementary data collections, through the introduction of previously archived data or legacy collections, and from streamed data arriving from multiple sources. 

It is important to distinguish Big Data from "lotsa data" or "massive data".  In a Big Data Resource, all three v's must apply.  It is the size, complexity, and restlessness of Big Data resources that account for the methods by which these resources are designed, operated, and analyzed.

The term "massive data" or "lotsa data" is often applied to enormous collections of simple-format records.  Massive datasets are typically equivalent to enormous spreadsheets (2-dimensional tables of columns and rows), mathematically equivalent to an immense matrix. For scientific purposes, it is sometimes necessary to analyze all of the data in a matrix, all at once.  The analyses of enormous matrices is computationally intensive, and may require the resources of a supercomputer. 

Big Data resources are not equivalent to a large spreadsheet, and a Big Data resource is seldom analyzed in its totality.  Big Data analysis is a multi-step process whereby data is extracted, filtered, and transformed, with analysis often proceeding in a piecemeal, sometimes recursive, fashion.  

If you read  Big Data: Preparing, Sharing, and Analyzing Complex Information you will find that the gulf between massive data and Big Data is profound; the two subjects can seldom be discussed productively within the same venue.

- Jules Berman

key words: Big Data, lotsa data, massive data, data analysis, data analyses, large-scale data, Big Science, simple data, little science, little data, small data, data preparation, data analysis, data analyst



Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Saturday, June 1, 2013

Differences between Big Data and Small Data


Big Data is very different from small data.  Here are some of the  important features that distinguish one from the other.

1. Goals

  • small data-Usually designed to answer a specific question or serve a particular goal. 
  • Big Data-Usually designed with a goal in mind, but the goal is flexible and the questions posed are protean. 


2. Location

  • small data-Typically, small data is contained within one institution, often on one computer, sometimes in one file. 
  • Big Data-Typically spread throughout electronic space, typically parceled onto multiple Internet servers, located anywhere on earth. 


3. Data structure and content

  • small data-Ordinarily contains highly structured data. The data domain is restricted to a single discipline or subdiscipline. The data often comes in the form of uniform records in an ordered spreadsheet. 
  • Big Data-Must be capable of absorbing unstructured data (e.g., such as free-text documents, images, motion pictures, sound recordings, physical objects). The subject matter of the resource may cross multiple disciplines, and the individual data objects in the resource may link to data contained in other, seemingly unrelated, Big Data resources. 


4. Data preparation

  • small data-In many cases, the data user prepares her own data, for her own purposes. 
  • Big Data-The data comes from many diverse sources, and it is prepared by many people. People who use the data are seldom the people who have prepared the data. 


5. Longevity

  • small data-When the data project ends, the data is kept for a limited time and then discarded. 
  • Big Data-Big Data projects typically contain data that must be stored in perpetuity. Ideally, data stored in a Big Data resource will be absorbed into another resource when the original resource terminates. Many Big Data projects extend into the future and the past (e.g., legacy data), accruing data prospectively and retrospectively. 


6. Measurements

  • small data-Typically, the data is measured using one experimental protocol, and the data can be represented using one set of standard units (see Glossary item, Protocol). 
  • Big Data -Many different types of data are delivered in many different electronic formats. Measurements, when present, may be obtained by many different protocols. Verifying the quality of Big Data is one of the most difficult tasks for data managers. 


7. Reproducibility

  • small data-Projects are typically repeatable. If there is some question about the quality of the data, reproducibility of the data, or validity of the conclusions drawn from the data, the entire project can be repeated, yielding a new data set. 
  • Big Data-Replication of a Big Data project is seldom feasible. In most instances, all that anyone can hope for is that bad data in a Big Data resource will be found and flagged as such. 


8. Stakes

  • small data-Project costs are limited. Laboratories and institutions can usually recover from the occasional small data failure. 
  • Big Data-Big Data projects can be obscenely expensive. A failed Big Data effort can lead to bankruptcy, institutional collapse, mass firings, and the sudden disintegration of all the data held in the resource. 


9. Introspection

  • small data-Individual data points are identified by their row and column location within a spreadsheet or database table (see Glossary item, Data point). If you know the row and column headers, you can find and specify all of the data points contained within. 
  • Big Data-Unless the Big Data resource is exceptionally well designed, the contents and organization of the resource can be inscrutable, even to the data managers (see Glossary item, Data manager). Complete access to data, information about the data values, and information about the organization of the data is achieved through a technique herein referred to as introspection (see Glossary item, Introspection). 


10. Analysis

  • small data-In most instances, all of the data contained in the data project can be analyzed together, and all at once. 
  • Big Data-With few exceptions, such as those conducted on supercomputers or in parallel on multiple computers, Big Data is ordinarily analyzed in incremental steps (see Glossary items, Parallel computing, MapReduce). The data are extracted, reviewed, reduced, normalized, transformed, visualized, interpreted, and reanalyzed with different methods.

Next blog post in series


tags: Big Data, lotsa data, massive data, data analysis, data analyses, large-scale data, Big Science, simple data, little science, little data, small data, data preparation, data analysis, data analyst


Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.