Excerpt Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information,by Jules J Berman (see yesterday's blog).
1. Goals
- small data-Usually designed to answer a specific question or serve a particular goal.
- Big Data-Usually designed with a goal in mind, but the goal is flexible and the questions posed are protean.
2. Location
- small data-Typically, small data is contained within one institution, often on one computer, sometimes in one file.
- Big Data-Typically spread throughout electronic space, typically parceled onto multiple Internet servers, located anywhere on earth.
3. Data structure and content
- small data-Ordinarily contains highly structured data. The data domain is restricted to a single discipline or subdiscipline. The data often comes in the form of uniform records in an ordered spreadsheet.
- Big Data-Must be capable of absorbing unstructured data (e.g., such as free-text documents, images, motion pictures, sound recordings, physical objects). The subject matter of the resource may cross multiple disciplines, and the individual data objects in the resource may link to data contained in other, seemingly unrelated, Big Data resources.
4. Data preparation
- small data-In many cases, the data user prepares her own data, for her own purposes.
- Big Data-The data comes from many diverse sources, and it is prepared by many people. People who use the data are seldom the people who have prepared the data.
5. Longevity
- small data-When the data project ends, the data is kept for a limited time and then discarded.
- Big Data-Big Data projects typically contain data that must be stored in perpetuity. Ideally, data stored in a Big Data resource will be absorbed into another resource when the original resource terminates. Many Big Data projects extend into the future and the past (e.g., legacy data), accruing data prospectively and retrospectively.
6. Measurements
- small data-Typically, the data is measured using one experimental protocol, and the data can be represented using one set of standard units (see Glossary item, Protocol).
- Big Data -Many different types of data are delivered in many different electronic formats. Measurements, when present, may be obtained by many different protocols. Verifying the quality of Big Data is one of the most difficult tasks for data managers.
7. Reproducibility
- small data-Projects are typically repeatable. If there is some question about the quality of the data, reproducibility of the data, or validity of the conclusions drawn from the data, the entire project can be repeated, yielding a new data set.
- Big Data-Replication of a Big Data project is seldom feasible. In most instances, all that anyone can hope for is that bad data in a Big Data resource will be found and flagged as such.
8. Stakes
- small data-Project costs are limited. Laboratories and institutions can usually recover from the occasional small data failure.
- Big Data-Big Data projects can be obscenely expensive. A failed Big Data effort can lead to bankruptcy, institutional collapse, mass firings, and the sudden disintegration of all the data held in the resource.
9. Introspection
- small data-Individual data points are identified by their row and column location within a spreadsheet or database table (see Glossary item, Data point). If you know the row and column headers, you can find and specify all of the data points contained within.
- Big Data-Unless the Big Data resource is exceptionally well designed, the contents and organization of the resource can be inscrutable, even to the data managers (see Glossary item, Data manager). Complete access to data, information about the data values, and information about the organization of the data is achieved through a technique herein referred to as introspection (see Glossary item, Introspection).
10. Analysis
- small data-In most instances, all of the data contained in the data project can be analyzed together, and all at once.
- Big Data-With few exceptions, such as those conducted on supercomputers or in parallel on multiple computers, Big Data is ordinarily analyzed in incremental steps (see Glossary items, Parallel computing, MapReduce). The data are extracted, reviewed, reduced, normalized, transformed, visualized, interpreted, and reanalyzed with different methods.
Next blog post in series
tags: Big Data, lotsa data, massive data, data analysis, data analyses, large-scale data, Big Science, simple data, little science, little data, small data, data preparation, data analysis, data analyst
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.