Saturday, March 26, 2016

Intro to Class Blending

I thought I'd devote the next few blogs to a concept that has gotten much less attention than it deserves: blended classes. Class blending lurks behind much of the irreproducibility in "Big Science" research, including clinical trials. It also is responsible for impeding progress in various disciplines of science, particularly the natural sciences, where classification is of utmost importance. We'll see that the scientific literature is rife with research of dubious quality, based on poorly designed classifications and blended classes.

For today, let's start with a definition and one example. We'll discuss many more specific examples in future blogs.

Blended class - Also known as class noise, subsumes the more familiar, but less precise term, "Labeling error." Blended class refers to inaccuracies (e.g., misleading results) introduced in the analysis of data due to errors in class assignments (i.e., assigning a data object to class A when the object should have been assigned to class B). If you are testing the effectiveness of an antibiotic on a class of people with bacterial pneumonia, the accuracy of your results will be forfeit when your study population includes subjects with viral pneumonia, or smoking-related lung damage. Errors induced by blending classes are often overlooked by data analysts who incorrectly assume that the experiment was designed to ensure that each data group is composed of a uniform and representative population. A common source of class blending occurs when the classification upon which the experiment is designed is itself blended. For example, imagine that you are a cancer researcher and you want to perform a study of patients with malignant fibrous histiocytomas (MFH), comparing the clinical course of these patients with the clinical course of patients who have other types of tumors. Let's imagine that the class of tumors known as MFH does not actually exist; that it is a grab-bag term erroneously assigned to a variety of other tumors that happened to look similar to one another. This being the case, it would be impossible to produce any valid results based on a study of patients diagnosed as MFH. The results would be a biased and irreproducible cacaphony of data collected across different, and undetermined, species of tumors. This specific example, of the blended MFH class of tumors, is selected from the real-life annals of tumor biology (1), (2).


[1] Al-Agha OM, Igbokwe AA. Malignant fibrous histiocytoma: between the past and the present. Arch Pathol Lab Med 132:1030-1035, 2008.

[2] Nakayama R, Nemoto T, Takahashi H, Ohta T, Kawai A, Seki K, et al. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Modern Pathology 20:749-759, 2007.

- Jules Berman (copyrighted material)

key words: data science, irreproducible results, complexity, classification, ontology, ontologies, jules j berman

No comments: