Wednesday, January 13, 2016


Here is a short excerpt from my book Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information


The human brain is constantly processing visual and other sensory information collected from the environment. When we walk down the street, we see images of concrete and asphalt and millions of blades of grass, and birds, and dogs, and other persons, and so on. Every step we take conveys a new world of sensory input. How can we process it all? The mathematician and philosopher Karl Pearson has (1857 - 1936) has likened the human mind to a "sorting machine". We take a stream of sensory information and sort it into objects; we then we collect the individual objects into general classes. The green stuff on the ground is classified as "grass," and the grass is subclassified under some larger grouping, such as "plants." A flat stretch of asphalt and concrete may be classified as a "road" and the road might be subclassified under "man-made constructions". If we lacked a culturally-determined classification of objects for our world, we would be overwhelmed by sensory input, and we would have no way to remember what we see, and no way to draw general inferences about anything. Simply put, without our ability to classify, we would not be human.

Every culture has some particular way to impose a uniform way of perceiving the environment. In English-speaking cultures, the term "hat" denotes a universally recognized object. Hats may be composed of many different types of materials, and they may vary greatly in size, weight, and shape. Nonetheless, we can almost always identify a hat when we see one, and we can distinguish a hat from all other types of objects. An object is not classified as a hat simply because it shares a few structural similarities with other hats. A hat is classified as a hat because it has a class relationship; all hats are items of clothing that fit over the head. Likewise, all biological classifications are built by relationships, not by similarities.

Aristotle was one of the first experts in classification. His greatest insight came when he correctly identified a dolphin as a mammal. Through observation, he knew that a large group of animals was distinguished by a gestational period in which a developing embryo is nourished by a placenta, and the offspring are delivered into the world as formed, but small versions of the adult animals (i.e., not as eggs or larvae), and the newborn animals feed from milk excreted from nipples, overlying specialized glandular organs (mammae). Aristotle knew that these features, characteristic of mammals, were absent in all other types of animals. He also knew that dolphins had all these features; fish did not. He correctly reasoned that dolphins were a type of mammal, not a type of fish. Aristotle was ridiculed by his contemporaries for whom it was obvious that dolphins were a type of fish. Unlike Aristotle, they based their classification on similarities, not on relationships. They saw that dolphins looked like fish and dolphins swam in the ocean like fish, and this was all the proof they needed to conclude that dolphins were indeed fish. For about two thousand years following the death of Aristotle, biologists persisted in their belief that dolphins were a type of fish. For the past several hundred years, biologists have acknowledged that Aristotle was correct after all; dolphins are mammals. Aristotle discovered and taught the most important principle of classification; that classes are built on relationships among class members; not by counting similarities. We will see in later chapters, that methods of grouping data objects by similarity can be very misleading, and should not be used as the basis for constructing a classification or an ontology.

A classification is a very simple form of ontology, in which each class is limited to one parent class. To build a classification, the ontologist must do the following: 1) define classes (i.e., find the properties that define a class and extend to the subclasses of the class); 2) assign instances to classes; 3) position classes within the hierarchy; and 4) test and validate all the above.

The constructed classification becomes a hierarchy of data objects conforming to a set of principles:

1. The classes (groups with members) of the hierarchy have a set of properties or rules that extend to every member of the class and to all of the subclasses of the class, to the exclusion of unrelated classes . A subclass is itself a type of class wherein the members have the defining class properties of the parent class plus some additional property(ies) specific for the subclass.

2. In a hierarchical classification, each subclass may have no more than one parent class. The root (top) class has no parent class. The biological classification of living organisms is a hierarchical classification.

3. At the bottom of the hierarchy is the class instance. For example, your copy of this book is an instance of the class of objects known as "books".

4. Every instance belongs to exactly one class.

5. Instances and classes do not change their positions in the classification. As examples, a horse never transforms into a sheep, and a book never transforms into a harpsichord.

6. The members of classes may be highly similar to one another, but their similarities result from their membership in the same class (i.e., conforming to class properties), and not the other way around (i.e., similarity alone cannot define class inclusion).

Classifications are always simple; the parental classes of any instance of the classification can be traced as a simple, non-branched list, ascending through the class hierarchy. As an example, here is the lineage for the domestic horse (Equus caballus), from the classification of living organisms:

Equus caballus
Equus subg. Equus
Fungi/Metazoa group
cellular organisms
The words in this zoologic lineage may seem strange to laypersons, but taxonomists who view this lineage instantly grasp the place of domestic horses in the classification of all living organisms.

A classification is a list of every member class, along with their relationships to other classes. Because each class can have only one parent class, a complete classification can be provided when we list all the classes, adding the name of the parent class for each class on the list. For example, a few lines of the classification of living organisms might be:
Craniata, subclass of Chordata
Chordata, subclass of Duterostomia
Deuterostomia, subclass of Coelomata
Coelomata, subclass of Bilateria
Bilateria, sublcass of Eumetazoa
Given the name of any class, a programmer can compute (with a few lines of code), the complete ancestral lineage for the class, by iteratively finding the parent class assigned to each ascending class.

A taxonomy is a classification with the instances "filled in." This means that for each class in a taxonomy, all the known instances (i.e., member objects) are explicitly listed. For the taxonomy of living organisms, the instances are named species. Currently, there are several million named species of living organisms, and each of these several million species is listed under the name of some class included in the full classification.

Classifications drive down the complexity of their data domain, because every instance in the domain is assigned to a single class, and every class is related to the other classes through a simple hierarchy.

It is important to distinguish a classification system from an identification system. An identification system puts a data object into its correct slot within the classification. For example, a fingerprint matching system may look for a set of features that puts a fingerprint into a special subclass of all fingerprint, but the primary goal of fingerprint matching is to establish the identity of an instance (i.e., to show that two sets of fingerprints belong to the same person). In the realm of medicine, when a doctor renders a diagnosis on a patient's diseases, she is not classifying the disease; she is finding the correct slot within the pre-existing classification of diseases that holds her patient's diagnosis.

-Jules Berman (Copyrighted material)

key words: classification, ontology, taxonomy, jules j berman