Saturday, February 3, 2018

Paradoxes of Classification (and terrible Class definitions)

The formal systems that assign data objects to classes, and that relate classes to other classes, are known as ontologies. When the data within a Big Data resource is classified within an ontology, data analysts can determine whether observations on a single object will apply to other objects in the same class. Similarly, data analysts can begin to ask whether observations that hold true for a class of objects will relate to other classes of objects. Basically, ontologies help scientists fulfill one of their most important tasks; determining how things relate to other things.

A classification is a very simple form of ontology, in which each class is allowed to have only one parent class. To build a classification, the ontologist must do the following: 1) define classes (i.e., find the properties that define a class and extend to the subclasses of the class); 2) assign instances to classes; 3) position classes within the hierarchy; and 4) test and validate all the above.

The constructed classification becomes a hierarchy of data objects conforming to a set of principles:

  • The classes (groups with members) of the hierarchy have a set of properties or rules that extend to every member of the class and to all of the subclasses of the class, to the exclusion of unrelated classes . A subclass is itself a type of class wherein the members have the defining class properties of the parent class plus some additional property(ies) specific for the subclass.

  • In a hierarchical classification, each subclass may have no more than one parent class. The root (top) class has no parent class. The biological classification of living organisms is a hierarchical classification.
  • At the bottom of the hierarchy is the class instance. For example, your copy of this book is an instance of the class of objects known as "books".
  • Every instance belongs to exactly one class.
  • Instances and classes do not change their positions in the classification. As examples, a horse never transforms into a sheep, and a book never transforms into a harpsichord.
  • The members of classes may be highly similar to one another, but their similarities result from their membership in the same class (i.e., conforming to class properties), and not the other way around (i.e., similarity alone cannot define class inclusion).

Classifications are always simple; the parental classes of any instance of the classification can be traced as a simple, non-branched list, ascending through the class hierarchy. As an example, here is the lineage for the domestic horse (Equus caballus), from the classification of living organisms:

Equus caballus
Equus subg. Equus
Equus
Equidae
Perissodactyla
Laurasiatheria
Eutheria
Theria
Mammalia
Amniota
Tetrapoda
Sarcopterygii
Euteleostomi
Teleostomi
Gnathostomata
Vertebrata
Craniata
Chordata
Deuterostomia
Coelomata
Bilateria
Eumetazoa
Metazoa
Fungi/Metazoa group
Eukaryota
cellular organisms

Taxonomists who view this lineage instantly grasp the place of domestic horses in the classification of all living organisms.

The rules for constructing classifications seem obvious and simplistic. Surprisingly, the task of building a logical, and self-consistent classification is extremely difficult. Most classifications are rife with logical inconsistencies and paradoxes. Let's look at a few examples.

In 1975, while touring the Bethesda, Maryland campus of the National Institutes of Health, I was informed that their Building 10, was the largest all-brick building in the world, providing a home to over 7 million bricks . Soon thereafter, an ambitious construction project was undertaken to greatly expand the size of Building 10. When the work was finished, building 10 was no longer the largest all-brick building in the world. What happened? The builders used material other than brick, and Building 10 lost its classification as an all-brick building, violating the immutability rule of class assignments.

Apparent paradoxes that plague any formal conceptualization of classifications are not difficult to find. Let's look at a few more examples.

Consider the geometric class of ellipses; planar objects in which the sum of the distances to two focal points is constant. Class Circle is a child of Class Ellipse, for which the two focal points of instance members occupy the same position, in the center, producing a radius of constant size. Imagine that Class Ellipse is provided with a class method called "stretch", in which the foci are moved further apart, thus producing flatter objects. When the parent class "stretch" method is applied to members of the Class Circle, the circle stops being a circle and becomes an ordinary ellipse. Hence the inherited "stretch" method forces members of Class Circle to transition out of their assigned class, violating the intransitive rule of classifications.

Let's look at the "Bag" class of objects. A "Bag" is a collection of objects, and the Class Bag is included in most object-oriented programming languages. A "Set" is also a collection of objects (i.e., a subclass of Bag), with the special feature that duplicate instances are not permitted. For example, if Kansas is a member of the set of U.S. States, then you cannot add a second state named "Kansas" to the set. If Class Bag were to have an "increment" method, that added "1" to the total count of objects in the bag, whenever an object is added to Class Bag, then the "increment" method would be inherited by all of the subclasses of Class Bag, including Class Set. But Class Set cannot increase in size when duplicate items are added. Hence, inheritance creates a paradox in the Class Set.

How does a data scientist deal with class objects that disappear from their assigned class and reappear elsewhere? In the examples discussed here, we saw the following:

  1. Building 10 at NIH was defined as the largest all-brick building in the world. Strictly speaking, Building 10 was a structure, and it had a certain weight and dimensions, and it was constructed of brick. "Brick" is an attribute or property of buildings, and properties cannot form the basis of a class of building, if they are not a constant feature shared by all members of the class (i.e., some buildings have bricks; others do not). Had we not conceptualized an "all-brick" class of building, we would have avoided any confusion.

  2. Class Circle qualified as a member of Class Ellipse, because a circle can be imagined as an ellipse whose two focal points happen to occupy the same location. Had we defined Class Ellipse to specify that class members must have two separate focal points, we could have excluded circles from class Ellipse. Hence, we could have safely included the stretch method in Class Ellipse without creating a paradox.

  3. Class Set was made a subset of Class Bag, but the increment method of class Bag could not apply to Class Set. We created Class Set without taking into account the basic properties of Class Bag, which must apply to all its subclasses. Perhaps it would have been better if Class Set and Class Bag were created as children of Class Collection; each with its own set of properties.

Worst Class Definition Ever

The worst definition of a Class may have been that given to the Kingdom of Protozoa, defined as the class of one-celled eukaryotic organisms. The problem here is that all of the classes of multicelled organisms (e.g., animals, plants and fungi) descended from classes of one-celled organisms. This means that Class Protozoa (defined as one-cell organisms) must exclude from its lineage all descendant classes that are multicellular. Hence, Kingdom Protozoa was given a definition that, paradoxically, excluded its own descendants. What there they thinking, back in the mid-19th century when Class Protozoa was conceived?

- Jules Berman

key words: classification, ontology, taxonomy, paradoxes, precision medicine, jules j berman Ph.D., M.D.

No comments: