Tuesday, February 2, 2016

When Reviewing Sets of Data, Always Examine the Range

After you have had a chance to look at the data, it is prudent to determine the highest and the lowest observed values in your data collection (i.e., the range of the data). These two numbers are often the most important numbers in any set of data; even more important than determining the average or the standard deviation. Where the data begins and ends tells the data scientists a great deal about the intrinsic meaning of the data. Moreover, your data must fit within the range of the device that produced the data measurements. Most devices have a range for which they can detect data fairly accurately, the so-called dynamic range (See Glossary item, Accuracy versus precision). Below that range, they might register the measurement as zero, or some fixed minimum value,or as some random value (i.e., noise). Above the range, the instrument might register a fixed maximum value, or some number larger than the maxima (i.e., more noise). Ideally, all of the data elements in your collection will fall well within the dynamic range of the measurement instrument. In any case, it is vital to know the range of the measured data and the dynamic range of the measurement instrument. Data values higher than or lower than the dynamic range do not contain useful information.

It really is not unusual for otherwise intelligent data scientists to develop sophisticated data models for totally spurious measurements that lie outside the dynamic range of their instruments (See Glossary item, Data modeling). Here is an example. You are looking at human subject data that includes weights. You find that the maximum weight in the data set is 300 pounds, exactly. There are many individuals in the data set who have a weight of 300 pounds, but no individuals with a weight exceeding 300 pounds. You also find that the number of individuals weighing 300 pounds is much greater than the number of individuals weighting 290 pounds. What does this tell you? Obviously, the people included in the data set have been weighed on a scale that tops off at 300 pounds. Most of the people whose weight was recorded as 300 will have a false weight measurement. Had we not looked for the maximum value in the data set, we would have assumed, incorrectly, that the weights were valid (1).

It might be useful to get some idea of how weights are distributed in the population exceeding 300 pounds (i.e., the population outside the dynamic range of the scale). One way of estimating the error is to look at the number of people weighing 295 pounds, 290 pounds, 285 pounds, etc. By observing the trend, and knowing the total number of individuals who weigh at least 300 pounds, you can estimate the number of people falling into the weight categories exceeding 300 pounds.

Here is another example where knowing the maxima for a data set measurement is useful. You are looking at a collection of data on meteorites. The measurements includes weights. You notice that the largest meteorite in the large collection weighs 66 tons (equivalent to about 60,000 kilograms), and has a diameter of about 3 meters. Small meteorites are more numerous than large meteorites, but almost every weight category is accounted for by one or more meteorites, up to 66 tons. After that, nothing. You check the published data on meteorites and find that none of your colleagues have reported finding meteorites weighing in excess of about 66 tons. Why do meteorites have a maximum size of about 66 tons (See Glossary items, Meta-analysis, Missing values)?

A little checking tells you that meteors in space can come in just about any size, from a speck of dust to a moon-sized rock. Collisions with earth have involved meteorites much larger than 3 meters. You check the astronomical records and you find that the meteor that may have caused the extinction of large dinosaurs about 65 million years ago, was estimated at 6 to 10 kilometers (at least 2000 times the diameter of the largest meteorite found on earth).

There is a very simple reason why the largest meteorite found on earth weighs about 66 tons, while the largest meteorites to impact the earth are known to be thousands of time heavier. When meteorites exceed 66 tons, the impact energy can exceed the energy produced by an atom bomb blast. Meteorites larger than 66 tons leave an impact crater, but the meteor itself disintegrates on impact (1).

As it turns out, much is known about meteorite impacts. The kinetic energy of the impact is determined by the mass of the meteor and the square of the velocity. The minimum velocity of a meteor at impact is about 11 km/second (equivalent to the minimum escape velocity for sending an object from earth into space). The fastest impacts occur at about 70 km per second. From this data, the energy released by meteors, on impact with the earth, can be easily calculated.

By observing the maximum weight of meteors found on earth we learn a great deal about meteoric impacts. When we look at the distribution of weights, we can see that small meteorites are more numerous than larger meteorites. If we develop a simple formula that relates the size of a meteorite with its frequency of occurrence, we can predict the likelihood of the arrival of a meteorite on earth, for every weight of meteorite, including those weighing more than 66 tons, over any interval of time.

[1] Berman JJ. Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Morgan Kaufmann, Waltham, MA, 2013.

- Jules Berman (copyrighted material)

key words: range, dynamic range, maxima, minima, maximum, minimum, data analysis, data science, data simplification, jules j berman