Wednesday, June 5, 2013

Toward Big Data Immutability


Today's blog continues yesterday's discussion of Big Data Immutability.

Big Data managers must do what seems to be impossible; they must learn how to modify data without altering the original content.  The trick is accomplished with identifiers and time-stamps attached to event data (and yes, it's all discussed at greater length in my book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information).

In today's blog, let's just focus on the concept of a time-stamp. Temporal events must be given a time-stamp indicating the time that the event occurred, using a standard measurement for time. The time-stamp must be accurate, persistent, and immutable.

Time-stamps are not tamper-proof. In many instances, changing a recorded time residing in a file or data set requires nothing more than viewing the data on your computer screen and substituting one date and time for another.  Dates that are automatically recorded, by your computer system, can also be altered. Operating systems permit users to reset the system date and time.  Because the timing of events can be altered, scrupulous data managers employ a trusted time-stamp protocol by which a time-stamp can be verified.

Here is a description of how a trusted time-stamp protocol might work.  You have just created a message, and you need to document that the message existed on the current date.  You create a one-way hash on the message (a fixed-length sequence of seemingly random alphanumeric characters). You send the one-way hash sequence to your city's newspaper, with instructions to publish the sequence in the classified section of that day's late edition. You're done.  Anyone questioning whether the message really existed on that particular date can perform their own one-way has on the message and compare the sequence with the sequence that was published in the city newspaper on that date.  The sequences will be identical to each other.

Today, newspapers are seldom used in trusted time stamp protocols.  Cautious Big Data managers employ trusted time authorities and encrypted time values to create authenticated and verifiable time-stamp data.  It's all done quickly and transparently, and you end up with event data (log-ins, transactions, quantities received, observations, etc.) that are associated with an identifier, a time, and a descriptor (e.g., a tag that explains the data).  When new events occur, they can be added to a data object containing related event data.  The idea behind all this activity is that old data need never be replaced by new data.  Your data object will always contain the information needed to distinguish one event from another, so that you can choose the event data that is appropriate to your query or your analysis.

-Jules Berman

key words: Big Data, mutable, mutability, data persistence, time stamp, time stamping, encrypted time stamp, data object, time-stamping an event, archiving, dystopia, George Orwell, newspeak, persistence, persistent data, saving data, time-stamp

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.