Saturday, August 4, 2018

Second Edition of Principles and Practice of Big Data now on Science Direct

The Second edition of my book Principles and Practice of Big Data has just been released and is available for purchase at many sites, including Amazon.

For those of you fortunate enough to have access to Science Direct, you can download chapters of my book at:


  Author's Preface to Second Edition 

  Author's Preface to First Edition 

  Chapter 1. Introduction
    Section 1.  Definition of Big Data
    Section 2.  Big Data Versus small data
    Section 3.  Whence Comest Big Data?
    Section 4.  The Most Common Purpose of Big Data is to Produce small data
    Section 5.  Big Data Sits at the Center of the Research Universe
    Section 6.  Case Study: From the Press: Big Claims for Big Data

  Chapter 2. Providing Structure to Unstructured Data
    Section 1.  Nearly all Data is Unstructured and Unusable in its Raw Form
    Section 2.  Term Extraction
    Section 3.  Autocoding
    Section 4.  Concordances
    Section 5.  Indexing
    Section 6.  Machine Translation
    Section 7.  Case Study: Sorted Lists (Why and Why Not)
    Section 8.  Case Study: Doublet Lists 
    Section 9.  Case Study: Ngram Lists 
    Section 10.  Case Study: Proximity Searches Using Only a Concordance  
    Section 11.  Case Study (Advanced): Burrows Wheeler Transform (BWT) 

  Chapter 3. Identification, Deidentification, and Reidentification
    Section 1.  What are Identifiers?
    Section 2.  Difference Between an Identifier and an Identifier System
    Section 3.  Generating Identifiers
    Section 4.  Really Bad Identifier Methods
    Section 5.  Registered Unique Object Identifiers
    Section 6.  Deidentification
    Section 7.  Reidentification
    Section 8.  Case Study: Data Scrubbing
    Section 9.  Case Study: Identifiers in Image Headers
    Section 10.  Case Study: Hospital Registration
    Section 11.  Case Study: One-Way Hashes

  Chapter 4. Metadata, Semantics, and Triples
    Section 1.  Metadata
    Section 2.  eXtensible Markup Language
    Section 3.  Namespaces
    Section 4.  Semantics and Triples
    Section 5.  Case Study: Syntax for Triples 
    Section 6.  Case Study: RDF Schema
    Section 7.  Case Study: RDF Parsers and the Fungibility of Triples
    Section 8.  Case Study: Dublin Core 

  Chapter 5. Classifications and Ontologies
    Section 1.  It's All About Object Relationships 
    Section 2.  The Difference Between Object Relationships and Object Similarities
    Section 3.  Classifications, the Simplest of Ontologies
    Section 4.  Ontologies, Classes with Multiple Parents
    Section 5.  Choosing a Class Model
    Section 6.  Paradoxes
    Section 7.  Class Blending
    Section 8.  Common Pitfalls in Ontology Development
    Section 9.  Case Study: An Upper Level Ontology 
    Section 10.  Case Study: Visualizing Class Relationships 
    Section 11.  Case Study: Bringing Order from Chaos with the Classification of Living Organisms

  Chapter 6. Introspection
    Section 1.  Knowledge of Self
    Section 2.  Data Objects
    Section 3.  How Big Data Uses Introspection 
    Section 4.  Case Study: Timestamping Data 
    Section 5.  Case Study: A Visit to the TripleStore 

  Chapter 7. Data Integration and Software Interoperability
    Section 1.  Another Big Problem for Big Data
    Section 2.  The Standard for Standards
    Section 3.  Standard Trajectories
    Section 4.  Specifications and Standards
    Section 5.  Versioning
    Section 6.  Compliance Issues
    Section 7.  Interfaces to Big Data Resources
    Section 8.  Case Study: Standardizing the Chocolate Teapot

  Chapter 8. Immutability and Immortality
    Section 1.  The Importance of Data that Cannot Change  
    Section 2.  Immutability and Identifiers
    Section 3.  Persistent Data Objects
    Section 4.  Coping with the Data that Data Creates
    Section 5.  Reconciling Identifiers Across Institutions
    Section 6.  Case Study: The Trusted Timestamp
    Section 7.  Case Study: Blockchains and Distributed Ledgers
    Section 8.  Case Study: Zero-Knowledge Reconciliation   

  Chapter 9. Assessing the Adequacy of a Big Data Resource
    Section 1.  Looking at the Data 
    Section 2.  The Minimal Necessary Properties of Big Data 
    Section 3.  Case Study: Utilities for Viewing and Manipulating Very Large Files
    Section 4.  Case Study: Flattened Data 
    Section 5.  Case Study: Data that Comes with Conditions 

  Chapter 10. Measurement
    Section 1.  Accuracy and Precision
    Section 2.  Data Range
    Section 3.  Counting
    Section 4.  Normalizing, and Transforming Your Data
    Section 5.  Reducing Your Data
    Section 6.  Understanding Your Control
    Section 7.  Practical Significance of Measurements
    Section 8.  Case Study: Gene Counting
    Section 9.  Case Study: The Significance of Narrow Data Ranges
    Section 10.  Case Study (Advanced): Fast Fourier Transform
    Section 11.  Case Study (Advanced): Principal Component Analysis

  Chapter 11. Indispensable Tips for Fast and Simple Big Data Analysis
    Section 1.  Speed and Scalability
    Section 2.  Fast Operations, Suitable for Big Data, that Every Computer Supports
    Section 3.  Fast Correlation Methods
    Section 4.  Clustering 
    Section 5.  Methods for Data Persistence (Without Using a Database)
    Section 6.  Back_of_Envelope Computations for Big Data
    Section 7.  Fast Data Retrieval for Lists of any Size 
    Section 8.  Case Study: One-Pass Mean and Standard Deviation
    Section 9.  Case Study: Climbing a Classification
    Section 10.  Pre-computing lookup lists: Google's PageRank
    Section 11.  Case Study: A Database Example 
    Section 12.  NoSQL and other Non-Relational Big Data Databases

  Chapter 12. Finding the Clues in Large Collections of Data
    Section 1.  Denominators 
    Section 2.  Frequency Distributions
    Section 3.  Multimodality
    Section 4.  Outliers and Anomalies
    Section 5.  Case Study: Discarding the Noisiest Frequencies in a Data Signal
    Section 6.  Case Study: Predicting User Preferences
    Section 7.  Case Study: Multimodality in Legacy Data
    Section 8.  Case Study: Big and Small Black Holes

  Chapter 13. Using Random Numbers to Your Big Data Analytic Problems Down to Size
    Section 1.  The Remarkable Utility of (Pseudo)Random Numbers 
    Section 2.  Resampling and Permutating 
    Section 3.  Case Study: Sample Size and Power Estimates
    Section 4.  Monte Carlo Simulations
    Section 5.  Case Study: Monty Hall Problem: Solving What We Cannot Grasp
    Section 6.  Case Study: Frequency of Unlikely String of Occurrences 
    Section 7.  Case Study: The Infamous Birthday Problem
    Section 8.  Case Study: A Bayesian Analysis of Insurance Costs 

  Chapter 14. Special Considerations in Big Data Analysis
    Section 1.  Theory in Search of Data 
    Section 2.  Data in Search of Theory
    Section 3.  Overfitting
    Section 4.  Bigness Bias
    Section 5.  Too Much Data
    Section 6.  Fixing Data
    Section 7.  Data Subsets in Big Data: Neither Additive nor Transitive
    Section 8.  Additional Big Data Pitfalls
    Section 9.  Case Study: Curse of Dimensionality

  Chapter 15. Big Data Failures and How to Avoid (Some of) Them
    Section 1.  Failure is Common
    Section 2.  Failed Standards
    Section 3.  Blaming Complexity
    Section 4.  Perils of Redundancy
    Section 5.  Save Time and Money; Don’t Protect Data that Does not Need Protection
    Section 6.  An Approach to Big Data that May Work For You
    Section 7.  After Failure
    Section 8.  Case Study: Cancer Biomedical Informatics Grid, a Bridge too Far
    Section 9.  Case Study: The Gaussian Copula Function

  Chapter 16. Legalities
    Section 1.  Responsibility for the Accuracy and Legitimacy of Data
    Section 2.  Rights to Create, Use, and Share the Resource
    Section 3.  Copyright and Patent Infringements Incurred by Using Standards
    Section 4.  Protections for Individuals
    Section 5.  Consent
    Section 6.  Unconsented Data
    Section 7.  Good Policies are a Good Policy
    Section 8.  Case Study: The "Inconclusive" Data Analysis
    Section 9.  Case Study: The Havasupai Story
    Section 10.  Case Study: Double-edged Sword of the U.S. Data Quality Act 

  Chapter 17. Data Sharing 
    Section 1.  What Is Data Sharing, and Why Don't We Do More of It?
    Section 2.  Common Complaints
    Section 3.  Case Study: Life on Mars
    Section 4.  Case Study: Who Shares Their Data 
    Section 5.  Case Study: National Patient Identifier

  Chapter 18. Data Reanalysis: Much More Important than Analysis
    Section 1.  First Analysis (Nearly) Always Wrong 
    Section 2.  Why Reanalysis is More Important than Analysis
    Section 3.  Case Study: Reanalysis of Old JADE Collider Data 
    Section 4.  Case Study: Vindication Through Reanalysis 
    Section 5.  Case Study: Finding New Planets from Old Data 

  Chapter 19. Repurposing Big Data
    Section 1.  What is Data Repurposing? 
    Section 2.  Dark Data, Abandoned Data, and Legacy Data 
    Section 3.  Case Study: From Postal Code to Demographic Keystone 
    Section 4.  Case Study: Fingerprints and Data-driven Forensics
    Section 5.  Scientific Inferencing from a Databases of Genetic Sequences
    Section 6.  Case Study: Linking global warming to high-intensity hurricanes
    Section 7.  Case Study: Inferring climate trends with geologic data
    Section 8.  Case Study: Old tidal data, and the iceberg that sank the Titanic
    Section 9.  Case Study: Lunar Orbiter Image Recovery Project
    Section 10.  Case Study: The Cornucopia of the Natural Sciences

  Chapter 20. Societal Issues
    Section 1.  How Big Data Is Perceived by the Public
    Section 2.  Reducing Costs and Increasing Productivity with Big Data
    Section 3.  Public Mistrust
    Section 4.  Saving Us from Ourselves 
    Section 5.  Who is Big Data?
    Section 6.  Hubris and Hyperbole
    Section 7.  Case Study: The Citizen Scientists
    Section 8.  Case Study: 1984, by George Orwell


- Jules Berman

Wednesday, May 9, 2018

Read Precision Medicine and the Reinvention of Human Disease on ScienceDirect

It is regrettable that many of my textbooks are unaffordable to the majority of the potential market. For Example, Precision Medicine and the Reinvention of Human Disease sells on Amazon for $125. This book contains nearly a quarter-million words, and it must have cost the publisher a lot of money to print and distribute, but I certainly wish it could have been sold at a lower price.

As a remedy, for some of you, this book is being marketed by Elsevier (the owner of the Academic Press imprint under which is was published) through ScienceDirect, a subscription online book catalog bought by university libraries. This means that if you have online access to a university library that has paid for a ScienceDirect subscription, you may have free access to my book.

Precision Medicine and the Reinvention of Human Disease was published January 30, 2018, and it is possible that your university library may have a ScienceDirect subscription that does not yet access my book. After speaking today with my editor, it's my impression that ScienceDirect access for libraries is something akin to cable channel access for homes. You can add access to specific books or you can add access to bundles of books that cover areas of interest. If you have access to ScienceDirect, but your university doesn't yet have access to my book(s), please talk to your librarian and ask if he/she will add my Elsevier publications to their ScienceDirect subscription.

There is an excellent preview of Precision Medicine and the Reinvention of Human Disease at the Google books site.

- Jules Berman

key words: precision medicine, ScienceDirect, library acquisitions, book subscriptions, jules j berman Ph.D. M.D.

Thursday, February 15, 2018

Inscrutable Genes

  • "In most cases, the molecular consequences of disease, or trait-associated variants for human physiology, are not understood." from: Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature 2009;461:747–53.

The 1960s was a wonderful decade for the field of molecular genetics. Hundreds of inherited metabolic diseases were being studied. Most of these diseases could be characterized by a simple inherited mutation in a disease-causing gene. Back then, we thought we understood genetic diseases. Here’s how it all might have worked, if life were simple: one mutation! one gene ! one protein ! one disease. This lovely genetic parable, from a bygone generation, seldom applies in the era of Precision Medicine. The purpose of this section is to explain some of the complexities of modern genetics and to lay out the job of the Precision Medicine scientist who must dissect the pathways that lead from gene to disease.

In Precision Medicine and the Reinvention of Human Disease, two of the most confuding aspects of modern disease genetics are discussed: that a single disease may result from one of many distinct molecular defects; and that a single gene may produce many different diseases. These two countervailing phenomena tell us something very important about disease development. The first is that different pathways may converge to the same disease, and that any single gene may perturb a biological system (i.e., a living organism) in different ways. Some of that discussion is excerpted here.

There are numerous examples wherein mutations in one gene may result in more than one disease [2]. In some cases, each of the diseases caused by the altered gene is fundamentally similar (e.g., spherocytosis and elliptocytosis, caused by mutations in the alpha-spectrin gene; Usher syndrome type IIIA and retinitis pigmentosa-61 caused by mutations in the CLRN1 gene). In other case, diseases caused by the same gene may have no obvious relation to one another. For example, the APOE gene encodes apolipoprotein E, which is involved in the synthesis of lipoproteins. One common allele of the APOE locus, e4, increases the risk of Alzheimer disease and of heart disease, two disorders of no obvious clinical similarities [3,4].

Let’s look at a few other examples where mutations in a single gene play causal roles in the development of diverse diseases. For example, different mutations of the same gene, desmoplakin, cause the following diseases [2]:

  • Arrhythmogenic right ventricular dysplasia 8

  • Dilated cardiomyopathy with woolly hair and keratoderma

  • Lethal acantholytic epidermolysis bullosa

  • Keratosis palmoplantaris striata II

  • Skin fragility-woolly hair syndrome

How is it possible that errors in the gene coding for desmoplakin, a constituent protein found in intercellular junctions, could account for such apparently unrelated diseases as arrhythmogenic right ventricular dysplasia and lethal acantholytic epidermolysis bullosa? It happens that we know that specialized desmosomes in cardiac cells (i.e., intercalated discs) tightly couple myocytes so that they can function as a coordinated group. Desmosomes are also required to adhese skin epidermal cells to one another and to the underlying basement membrane. In the case of desmoplakin mutations, it is relatively easy to see the pathogenetic relationship among these diseases.

In other sets of diseases that result from an error in one specific gene, the pathogenetic relationship may not be so easily discerned. Some cases of Charcot-Marie-Tooth axonal neuropathy, lipodystrophy, Emery-Dreyfus muscular dystrophy, and premature aging syndromes are all caused by mutation in the LMNA (Lamin A/C) gene. Stickler syndrome type III, Fibrochondrogenesis-2, and a form of nonsyndromic hearing loss are all caused by mutations in the COL11A2 gene. In these cases, how can variations in a single gene cause many different diseases?

Let’s look at just a few of the possibilities:

  • One gene can control the synthesis of more than one protein [6].

  • A single protein may have multiple functions. For example, nuclear lamina (lamin a/c) has several biological roles: controlling nuclear shape, influencing transcription, and organizing heterochromatin. Mutations in the LMNA gene cause more than 10 different clinical syndromes, including neuromuscular and cardiac disorders, premature aging disorders, and lipodystrophy. Likewise, the polyfunctional TP53 gene has been linked to 11 clinically distinguishable cancer-related disorders [7].

  • A single protein with a single function may have different biological effects based on the cell type in which the protein is expressed, the stage of development in which the protein is expressed, and the cellular milieu (e.g., concentrations of substrate or protein inhibitors) for a given cell type, at a particular moment in time.

  • Diseases develop through a sequence of biological events occurring over time. A mutation may exert a different biological effect based on where and when, in the sequence of pathogenetic events, it is expressed.

more to follow

- Jules Berman

key words: precision medicine, genetics, multi-step, pathogenesis, genetic heterogeneity, jules j berman Ph.D. M.D.

Wednesday, February 14, 2018

Infections Develop Via a Sequence of Biological Steps

A prior post listed 7 assertions regarding the role of infectious organisms on the human genome. In the next few blogs we'll look at each assertion, in excerpts from Precision Medicine and the Reinvention of Human Disease. Here's the seventh:

By dissecting the biological steps involved in the pathogenesis of infectious disease, it is possible to develop new treatments, other than antibiotics, that will be effective against a range of related organisms.

Nature, by interfering with the different steps in the development of infectious diseases, has a variety of protective mechanisms against organisms. For example, to defend against malaria, nature has preserved various mutations that render red cells unsuitable hosts for malarial guests. For example, individuals with hemoglobin variants HbS (sickle cell trait), HbC, and HbE increase the likelihood that an infected red cell will lyse. Likewise, but for obscure reasons, regulatory defects in hemoglobin synthesis, as seen in thalassemia, may also confer some protection against malaria. Also, variations in a structural protein of erythrocytes, SLC4A1, causing ovalocytosis; and polymorphisms of the glucose-6-phosphate dehydrogenase gene [57] both seem to protect against malaria.

We see individuals resistant to malaria due to absence of the Duffy protein required for Plasmodium vivax to bind and enter erythrocytes [58]. Knowing this, the Duffy-binding protein in the malaria parasite is now being studied as a potential drug or vaccine target as a new strategy against malaria [58]. More generally, drugs known as entry inhibitors are being developed based on knowledge that the attachment and entry of organisms may depend upon specific cooperative pathways, in host and invader cells, that can be targeted by drugs. We know that there are many steps in the infection process that could be blocked by small changes in proteins that are unrelated to the immune process. For example, for an infectious agent to invade and flourish in an organism, it must gain entry into the tissues of the body, evading physical and chemical defenses along its way. It must find a place in which it can receive nourishment appropriate to its species and avoid any toxins that may be produced by its host. It must be able to grow as a collection of organisms, and this typically means that the host must permit some degree of invasion through its own tissues. These are just a few of the nonimmunological hurdles that invasive organisms must jump over, if they are to infect an organism. Every step in the pathogenesis of infectious disease provides another therapeutic opportunity. As we learn more about the pathways of development of infectious diseases that have become increasingly resistant to antibiotics, we will come to rely on Precision Medicine to prevent, diagnose, and treat infections.

- Jules Berman

key words: precision medicine, infections disease, biological steps, pathogenesis, jules j berman Ph.D., M.D.

Tuesday, February 13, 2018

Non-immunologic Causes of Increased Susceptibility to Disease

A prior post listed 7 assertions regarding the role of infectious organisms on the human genome. In the next few blogs we'll look at each assertion, in excerpts from Precision Medicine and the Reinvention of Human Disease. Here's the sixth:

Cellular defects that have no direct connection to immunity may increase susceptibility to infectious organisms.

If we want to understand why certain individuals are susceptible to infections and other individuals are not, we must understand that immune deficiencies cannot account for all infections. Infectious diseases, just like any other disease, develop in steps, and it stands to reason that there must be many different pathways through which those steps can be enhanced or blocked. Theory aside, what is the actual evidence that susceptibility to infectious diseases arise through deficiencies unrelated to the immune system?

  • Time and again, we encounter serious infections from organisms thought to be nonpathogenic, occurring in immunocompetent individuals [48–51].

  • Not everyone with an immune deficit will succumb to an infectious disease, implying that these individuals are protected by resistance mechanisms other than immunity.

  • We know of various genetic conditions that increase our susceptibility to infectious diseases, and some of these genetic flaws have nothing to do with the adaptive (i.e., antibody-forming) immune systems. For example, children with sickle cell disease or congenital asplenia will have a heightened susceptibility to invasive pneumococcal diseases [52]. Otherwise-normal children with IRAK4 or NEMO gene mutations will also have a high risk of invasive pneumococcal disease [52]. IRAK4 or NEMO genes code for proteins involved in the phagocytosis of bacteria by splenic macrophages. Likewise, in mice, natural resistance to infection is influenced by the Bcg gene, which affects the early phagocytosis and destruction of intracellular organisms by macrophages [53]. As a final example, both humans and zebrafish that have mutations that reduce the synthesis of a proinflammatory leukotriene have heightened susceptibility to Mycobacterium tuberculosis [54]. It is easy to find examples of nonimmunologic mechanisms for susceptibility to infections [55,56].

- Jules Berman

key words: precision medicine, immune system, susceptibility to disease, non-immunologic, jules j berman Ph.D., M.D.

Monday, February 12, 2018

Infection without Disease (from Precision Medicine and the Reinvention of Human Disease)

A prior post listed 7 assertions regarding the role of infectious organisms on the human genome. In the next few blogs we'll look at each assertion, in excerpts from Precision Medicine and the Reinvention of Human Disease. Here's the fifth:

Normal defenses can block every infectious disease. Hence, every infectious disease results from a failure of our normal defenses, immunologic and otherwise.

For any given infectious agent, no matter how virulent they may seem, there are always individuals who can resist infection. Moreover, as a generalization, the majority of individuals who are infected with a pathogenic microorganism will never develop any clinically significant disease [42].

As one example, Naegleria fowleri is often found in warm freshwater. Swimmers in contaminated waters may develop an infection that spreads from the nasal sinuses to the central nervous system, to produce an encephalitis that is fatal in 97% of cases [43]. Despite the hazard posed by Naegleria, health authorities do not generally test freshwater sources to determine the presence of the organism. Do not expect to find warning signs posted at swimming holes announcing that the water is contaminated by an organism that produces a disease that has a nearly 100% fatality rate. It is simply assumed that anyone who spends any time around freshwater will eventually be exposed to Naegleria. As it happens, although many thousands of individuals are exposed each year to Naegleria in the United States, only a few cases of Naegleria encephalitis occur in this country. In fact, since Naegleria was recognized as a cause of encephalitis, in 1965, fewer than 150 cases have been reported [44]. Most of the reported cases have occurred in children and adolescents and are associated with recreational water activities [45,46]. The children who develop Naeglerian encephalitis, though exhibiting no signs of immune deficiency, are nonetheless susceptible to Naegleria. What makes these children different from all the other children and adults who were exposed to the same organisms?

Neisseria meningitidis, a cause of bacterial meningitis, can be cultured from nasal swabs sampled from the general population. If N. meningitidis were a primary pathogen, then why doesn’t it cause disease in the vast majority of infected individuals. If N. meningitidis were an opportunistic infection, then why does it typically cause disease in healthy college-age individuals (not immunocompromised individuals)? If this organism is neither a primary pathogen nor an opportunistic pathogen, then what kind of pathogen is it? More importantly, why is N. meningitidis a potentially fatal pathogen in some individuals and a harmless commensal in others [47]?

Organisms that were formerly thought to be purely pathogenic are now known to frequently live quietly within infected humans, without causing symptoms of disease. For example, parasites such as the agents that cause Chagas disease, leishmaniases, and toxoplasmosis are commonly found living in apparently normal individuals. Viruses, including the agents that cause herpes simplex infections and infections by hepatitis viruses B and C, can be found in healthy individuals. Mycobacterium tuberculosis can infect an individual, produce a limited pathologic reaction in the lung, and remain in the body in a quiescent state for the life of the individual. In fact, it has been estimated that about one out of three individuals, worldwide, is infected with Mycobacterium tuberculosis, and will never suffer any consequences. Luckily, asymptomatic carriers of tuberculosis, in whom the there is no active pulmonary disease, are noninfective. Staphylococcus aureus, a bacterial pathogen that is known to produce abscesses, invade through tissues, and release toxins, is also known to circulate in the blood, without causing symptoms, in a sizeable portion of the human population [40].

We now know that potentially virulent organisms are normally tamed within our bodies. Hence, the root cause of every clinical infection results from a deficiency in the defenses of particular subpopulations of individuals.

- Jules Berman

key words: precision medicine, commensals, symbiotes, symbiotic, host organisms, latent infection, jules j berman Ph.D. M.D.

Sunday, February 11, 2018

Cellwise, We Are Mostly Inhuman

A prior post listed 7 assertions regarding the role of infectious organisms on the human genome. In the next few blogs we'll look at each assertion, in excerpts from Precision Medicine and the Reinvention of Human Disease. Here's the fourth:

Most of the cells residing in human bodies are nonhuman

There are about 10 times as many nonhuman cells living in our bodies as there are human cells [40]. The human intestines alone contain 40,000 different species of bacteria [9]. These 40,000 species contain about 9 million different genes. Compare that with the paltry 23,000 genes in the human genome, and we quickly see that we homo sapiens contribute very little to the genetic diversity of the human body’s ecosystem.

- Jules Berman

key words: precision medicine, commensals, symbiotes, symbiotic, host organisms, jules j berman Ph.D. M.D.

Saturday, February 10, 2018

Genome-Specific Responses to Infection

A prior post listed 7 assertions regarding the role of infectious organisms on the human genome. In the next few blogs we'll look at each assertion, in excerpts from Precision Medicine and the Reinvention of Human Disease. Here's the third:

A good portion of the genes in humans (perhaps 10%) are involved in responses to infectious organisms.

It has been estimated that over 1000 human genes are involved in inflammation pathways [37]. Several studies have shown that following an inflammatory challenge or challenged by the introduction of a pathogen, more than a hundred genes are activated [38–40]. The activated genes include some of the same genes that have been associated with autoimmune diseases, suggesting that these disease-associated genes are conserved because they have a beneficial role, protecting us from invading pathogens [39]. The genetic profile of genes activated by inflammation is very similar from human to human, but quite dissimilar from the profile of genes activated by inflammation in the mouse [41]. This would suggest that species develop their own genome-wide responses to agents that cause inflammation (e.g., invading organisms).

- Jules Berman

key words: precision medicine, evolution, virus, viral, jules j berman Ph.D. M.D.

Friday, February 9, 2018

Vertebrate Evolution Driven by DNA from Infectious Organisms

A prior post listed 7 assertions regarding the role of infectious organisms on the human genome. In the next few blogs we'll look at each assertion, in excerpts from Precision Medicine and the Reinvention of Human Disease. Here's the second:

Some of the key steps in the development of vertebrate animals, and mammals in particular, have come from DNA acquired from infectious organisms.

The human genome has preserved its viral ballast, at some cost. At every cell division, energy is expended to replicate the genome, and the larger the genome, the more energy must be expended. Why do we spend a large portion of the energy required to replicate our genome, on inactive sequences, of viral origin? Why doesn’t our genome simply eject the extra DNA, a biological process that is commonplace in the evolution of obligate intracellular parasitic organisms? Maybe it's because we use viral genes to our own advantage.

Two evolutionary leaps, benefiting the ancestral classes of humans, and owed to the acquisition of viral genes, include the attainment of adaptive immunity and the development of the mammalian placenta. Let’s take a moment to see how these innovations came about.

Adaptive immunity evolved at about the same time that jawed vertebrates first appeared on earth. The crucial gene responsible for the great leap to adaptive immunity, the recombination activating gene (RAG), was stolen from a retrovirus. To understand the pivotal evolutionary role of RAG, we need to review a bit of high school biology. The adaptive immune system responds to the specific chemical properties of foreign antigens, such as those that appear on viruses and other infectious agents. Adaptive immunity is a system wherein somatic T cells and B cells are produced, each with a unique and characteristic immunoglobulin (in the case of B cells) or T-cell receptor (in the case of T cells). Through a complex presentation and selection system, a foreign antigen elicits the replication of a B cell whose unique immunoglobulin molecule (i.e., so-called antibodies) matches the antigen. Secretion of matching antibodies leads to the production of antigen-antibody complexes that may deactivate and clear circulating antibodies, or may lead to the destruction of the organism that carries the antigen (e.g., virus or bacteria).

To produce the many unique B and T cells, each with a uniquely rearranged segment of DNA that encodes specific immunoglobulins or T-cell receptors, recombination and hypermutation must take place within a specific gene region. This process yields on the order of a billion unique somatic genes, and requires the participation of recombination activating genes (RAGs). The acquisition of a recombination activating gene is presumed to be the key evolutionary event that led to the development of the adaptive immune system present in all jawed vertebrates (gnathostomes). Before the appearance of the jawed vertebrates, this sort of recombination was genetically unavailable to animals. Our genes simply were not equal to the task. Retroviruses, however, are specialists at cutting, moving, and mutating DNA. Is it any wonder that the startling evolutionary leap to adaptive immunity was acquired from retrotransposons? Thus,we owe our most important defense against infections to genetic material retrieved from the vast trove of retrovirally derived DNA carried in our genome [33]. As one might expect, inherited mutations in RAG genes are the root causes of several immune deficiency syndromes [34,35].

Many millions of years later, vertebrates acquired another gene that did much to enable the evolution of all mammals. Members of Class Mammalia are distinguished by the development of the placenta, an organ that grows within the uterine cavity (i.e., the endometrium). After birth, the placenta must detach from the uterus. You can imagine the delicate balancing act between attaching firmly to the wall of the uterus and detaching cleanly from the wall of the uterus. During placental development, large, flat cells called cytotrophoblasts form the interface between placenta and uterus. To create the thin membrane that borders the lining of the uterus and that borders the blood received from the uterus in the spaces between the placental villi, the cytotrophoblasts must somehow fuse into a syncytium (i.e., multinucleate collections of cells that have fused together by dissolving their individual cytoplasmic membranes).

There is one task at which all animals excel: maintaining a clear separation between one cell and another. In point of fact, the most distinctive difference between animal cells and all other cells of eukaryotic origin happens to be the presence of cell junctions, whose purpose is to bind cells to one another without fusing cells. This being the case, you can see that the normal direction of animal evolution would preclude the appearance of a gene intended to form a huge syncytium of placental cells. Whereas animal cells are failures at fusion, viruses are champions. One of the most often-deployed methods by which viruses invade cells is through fusion at the cytoplasmic membrane. It happens that retroviral envelope genes, preserved in the human genome, do a very good job at fusing membranes. Animals captured a retroviral fusogenic envelope gene and inserted it into one of the first syncytin molecules involved the development of the placenta. Apparently, this acquisition worked out so well for mammals that later-evolving mammalian classes made their own retrovirus gene acquisitions to obtain additional syncytins, thus refining the placenta for their own subclasses [23,36].

- Jules Berman

key words: precision medicine, evolution, virus, viral, jules j berman Ph.D. M.D.

Thursday, February 8, 2018


Yesterday's post listed 7 assertions regarding the role of infectious organisms on the human genome. In the next few blogs we'll look at each assertion, in excerpts from Precision Medicine and the Reinvention of Human Disease. Here's the first:

A significant portion of the human genome consists of relic DNA derived from ancient invasive organisms.

About 8% of our genome is derived from sequences with similarity to known infectious retroviruses, and these longer sequences can usually be recognized by their contained subsequences (e.g., gag, pol, and env genes) and long terminal repeats. The viral sequences in our genomes are the remnants of ancient retroviral infections, and the occasional nonretroviral infection, that were branded into DNA, and subsequently amplified [21–23]. Because much of the endogenous retroviral load in the human genome is due to amplification, and subsequent mutation, it is hard to determine the number of retroviral species that established their niche in the human gene pool, but studies of these viral remains would suggest that we contain species from several dozen families of retroviruses, with an undetermined number of contributions from individual family members [24]. Based on comparisons of the viruses present in different species of primates, it would appear that the most recent acquisition of an endogenous retrovirus occurred in humans between 100,000 and 1 million years ago [25]. Most of the retroviral sequences in our genomes are inactivated due to an accumulation of degenerative mutations collected over the eons, indicating that there has been little or no selective pressure to conserve retroviruses in their pristine sequences.

- Jules Berman

key words: precision medicine, human genome, evolution, infectious diseases, jules j berman, Ph.D., M.D.

Wednesday, February 7, 2018

Infections have made their mark on the Human Genome

In the context of Precision Medicine, infections draw our attention because they have played an important role in the evolution of the eukaryotic genome. Over the next few blog posts, we will explore the following:

  • A significant portion of the human genome consists of relic DNA derived from ancient invasive organisms.
  • Some of the key steps in the development of vertebrate animals, and mammals in particular, have come from DNA acquired from infectious organisms.
  • A good portion of the genes in humans (perhaps 10%) are involved in responses to infectious organisms.
  • Most of the cells in the human (at least 90%) consist of infectious organisms and commensals that have adapted to life within human hosts. [Glossary Commensal]
  • Normal defenses can block every infectious disease. Hence, every infectious disease results from a failure of our normal defenses, immunologic and otherwise.
  • Cellular defects that have no direct connection to immunity may increase susceptibility to infectious organisms.
  • By dissecting the biological steps involved in the pathogenesis of infectious disease, it is possible to develop new treatments, other than antibiotics, that will be effective against a range of related organisms.
Over the next few blogs, we'll do our best to justify each of these (as yet) unproven assertions.

- Jules Berman

key words: precision medicine, infections, evolution, resistance to infection, jules j berman Ph.D., M.D.

Tuesday, February 6, 2018

Precision Medicine and Public Health (from Precision Medicine and the Reinvention of Human Disease)

Excerpted from Precision Medicine and the Reinvention of Human Disease

Despite having the most advanced healthcare technology on the planet, life expectancy in the United States is not particularly high. Citizens from most of the European countries and the highly industrialized Asian countries enjoy longer life expectancies than the United States. According to the World Health Organization, the United States ranks 31st among nations, trailing behind Greece, Chile, and Costa Rica, and barely edging out Cuba [42]. Similar rankings are reported by the US Central Intelligence Agency [43]. These findings lead us to infer that access to advanced technologies, such as those offered by Precision Medicine, will not extend lifespan significantly.

Every healthcare professional knows that most of the deaths occurring in this country can be attributed to personal lifestyle choices: smoking, drinking, drug abuse, and over-eating. Lifestyle diseases account for the majority of deaths in the United States and in otherwestern countries, these being:heartdisease,diabetes, obesity, andcancer.Population-basedtrials that seek to improve theways inwhichindividuals live, by introducing adaily exercise routine, healthydiet, and cigarette abstinence, have yielded huge benefits, in terms of extending average lifespans [44]. At the front end of the human life cycle, it has been demonstrated that infant mortalities can be markedly reduced with simple measures, focusing on improved maternal education [45]. It has been credibly argued that cleanwater, clean air, clean housing, clean food, and clean living yieldgreater societal benefits than clean operating rooms [46,47]. If this be the case, should we be investing heavily in Precision Medicine, when simple, low-tech public health measures are likely to provide a greater return on investment, in terms of overallmorbidity andmortality? In a certain sense, public health is the opposite of personalized medicine. Whereas personalized medicine involves finding the best possible treatment for individuals, based on their uniqueness, public health involves finding ways of treating whole populations based on their collective sameness. Let’s not dwell on these somewhat contrived philosophic points. Precision Medicine, as viewed in this book, is a new way of understanding human diseases. As such, Precision Medicine provides opportunities to advance both personalized medicine and public health.

Precision Medicine tells us that we should think of diseases as developmental process, with each step in the process representing an opportunity for intervention. Perhaps the most important function of Precision Medicine will be to give society the opportunity to institute public health measures aimed at blocking the pathogenesis of human diseases. Here are just a few examples:

– Population screening for early stages of common diseases.

The successful reduction in deaths from cervical cancer demonstrates the effectiveness of screening for early stages of disease. Cervical cancer is a type of squamous cell carcinoma that develops at the junction between the ectocervix (the squamous lined epithelium) and the endocervix (the glandular lined epithelium) in the os of the uterine cervix of women. Before the introduction of cervical precancer treatment, cervical carcinoma was one of the leading causes of cancer deaths in women worldwide. Today, in many countries that have not deployed precancer treatment, cervical cancer remains the leading cause of cancer deaths in women [48– 50]. In the United States, a 70% drop in cervical cancer deaths followed the adoption of routine Papsmear screening[51–53].Noeffort aimedat treatinginvasive cancers has providedanequivalent reduction in the number of cancer deaths. [Glossary Age-adjusted incidence, Pap smear] Today, we know that cervical carcinogenesis begins with a localized infection by one of several strains of human papillomavirus, transmitted during sexual intercourse by an infected male partner. In the late 1940s (and really up until the early 1980s), the viral etiology of cervical cancer was unknown. We did know that squamous cells sampled from the uterine os had highly characteristic morphologic appearances that preceded the development of invasive cancer. Thanks largely to the persistence of Dr. Papanicolaou and his coworkers, a standard screening test, known as the Pap smear, was developed to detect cervical precancers. If precancerous changes were found in a smear, a gynecologist could remove a superficial portion of the affected epithelium, and this would, in the vast majority of cases, stop the cancer from ever developing.

Morphologic and epidemiologic observations on Pap smears provided clues that eventually led to the identification of several strains of human papillomavirus as the major causes of cervical cancer. Today, a vaccine protective against carcinogenic strains of human papilloma virus is available [54].

As discussed in Precision Medicine and the Reinvention of Human Disease, Section 7.5, “What Is Precision Diagnosis?” new biomarkers are being developed for the early stages of disease, often preceding the development of any clinical symptoms. In general, diseases are easiest to treat in early stages, before they have had the chance to do any harm to organs. For example, precancers can often be effectively treated by excision, or, in some cases, by withdrawal of the agents that would otherwise lead to the progression of the precancer to the cancerous stage (e.g., cessation of hormonal replacement therapy to block breast cancer, cessation of smoking to block lung cancer, treatment of Helicobacter pylori infection to block MALToma).

We can hope that in the future advances in the field of Precision Medicine will identify the intermediate stages of development for common diseases. With this information, public health measures aimed at detecting and blocking diseases, in an early stage of development, will be deployed.

– The aggressive prevention and treatment for the most common patterns of diseases that lead to death

As discussed in Precision Medicine and the Reinvention of Human Disease, Section 2.3, “Cause of Death,” a well-composed death certificate contains a thoughtful sequence of medical conditions that develop over time, and that ultimately lead to the death of the patient. This data, if properly recorded and aggregated into a mortality database, should provide the most frequently occurring chains of events that account for human deaths. A public health effort aimed at breaking the early steps of these processes has the potential of extending the life expectancy of the population.

– Aggressive screening for carriers of infectious diseases

As discussed in Section 6.2, “Our Genome Is a Book Titled ‘The History of Human Infections,’” organisms that were formerly thought to be purely pathogenic are now known to frequently live quietly within infected humans, without causing symptoms of disease, and this would include the organisms that cause Chagas disease, leishmaniases, toxoplasmosis, tuberculosis, viruses such as Herpes viruses and hepatitis viruses B and C, and bacterial organisms, some of which circulate in the blood without causing disease under normal circumstances.

Sensitive diagnostic techniques, including genome sequencing of DNA in blood, may provide us with the opportunity to perform population screening for organisms that are opportunistic pathogens, or that produce long-term damage to carriers, or that are transmissible from carriers.

– Finding targets for vaccines that confer effectiveness against more than one target organism.

Thanks in no small part to Precision Medicine, we are learning that organisms play a role in many diseases that were once thought to have no infectious component. In particular, it is now widely accepted that infections contribute to at least one-fifth of all cancers occurring in humans. Examples of cancer causing organisms are:

– Epstein-Barr virus (B-cell lymphomas, Burkitt lymphoma, 
nasopharyngeal cancer, Hodgkin disease and T-cell lymphomas)
– Hepatitis B virus (hepatocellular carcinoma)
– Human papillomavirus types 5, 8, 14, 17, 20, 
and 47 (skin cancer)
– Human papillomavirus types 16, 18, 31, 33, 35, 39, 
45, 52, 56, 58 (cervical cancer, anogenital cancer)
– Human papillomavirus types 6 and 11 (verrucous 
– Human papillomavirus types 16, 18, 33, 57, 73 
(cancers of oral cavity, tongue, larynx, nasal cavity, 
and esophagus)
– Merkel cell polyomavirus (MCPyV) (Merkel cell carcinoma)
– HTLV-1 (adult T-cell leukemia)
– Human herpesvirus 8 (Kaposi sarcoma)
– Hepatitis C virus—hepatocellular carcinoma 
and low-grade lymphomas
– JC, BK, and SV40-like polyoma viruses (tumors 
of brain and pancreatic islet tumors, and mesotheliomas)
– Human endogenous retrovirus HERV-K 
(seminomas and germ cell tumors)
– Schistosomiasis and squamous cell carcinoma of 
– Opisthorchis viverrini and Clinorchis sinensis, 
flatworms (flukes), found in Southeast Asia, 
– Helicobacter pylori and gastric MALToma 
(Mucosa-Associated Lympoid tissue
lymphoma) [55]

Carcinogenic viruses profoundly influence the number of cancer deaths, worldwide. These include hepatitis B virus (associated with an increased incidence of hepatocellular carcinoma) and human papillomavirus (which causes cervical cancer). Liver cancer is the third leading cause of cancer deaths worldwide, accounting for 611,000 deaths in 2000 [50]. It is easy to understand that the importance of vaccine development for infections that contribute to chronic diseases and cancers cannot be overstated. As we learn more about the biological steps involved in the infection process, hope looms that vaccines and preventive drugs will be developed that target different types of organisms, based on shared properties of infection, invasion, immunologic resistance, persistence, or phylogeny, as discussed in Precision Medicine and the Reinvention of Human Disease, Section 4.4, “Pathway-Directed Treatments for Convergent Diseases,” [56–60].

- Jules Berman

key words: public health, prevention, precision medicine, cancer, cancer vaccines, jules j berman, Ph.D., M.D.

Monday, February 5, 2018

Treat the Pathway, not the Gene (from Precision Medicine and the Reinvention of Human Disease)

Treat the key pathway, not the genetic mutation (from Precision Medicine and the Reinvention of Human Disease)

Some of the earliest and most successful Precision Medication drugs have targeted specific mutations occurring in specific subsets of diseases. One such example is ivacaftor, which targets the G551D mutation present in about 4% of individuals with cystic fibrosis [135]. It is seldom wise to argue with success, but it must be mentioned that the cost of developing a new drug is about $5 billion [136]. To provide some perspective, $5 billion exceeds the total gross national product of many countries, including Sierra Leone, Swaziland, Suriname, Guyana, Liberia, and the Central African Republic. Many factors contribute to the development costs, but the most significant is the incredibly high failure rate of candidate drugs. About 95% of the experimental medicines that are studied in humans fail to be both effective and safe. The costs of drug development are reflected in the rising costs of drugs.

When a new drug is marketed to a very small population of affected individuals, the cost of treating an individual may be astronomical. Americans should not pin their hopes on the belief that one day, the FDA or CMS (which administrates Medicare) will step in and put a stop to the price rises. The Food and Drug Administration can approve or reject drugs, but it does not regulate prices. Likewise, Medicare is not permitted to consider cost when it decides whether a treatment can be covered. Knowing this, some notable pharmaceutical companies have raised the prices of medications far beyond their manufacturing costs [137–139]. In effect, the cost of curing curable diseases may exceed our ability to pay for those cures [139].

It is strongly in the interests of society to develop drugs that have the widest possible user market [140]. Drugs that target a mutation that is specific for a few individuals with a rare disease, or a tiny subpopulation of individuals who have a common disease, are highly problematic.

Our experiences with disease convergence teach us that clinical phenotypes are influenced by the activities of pathways and are seldom restricted to a specific mutation in a specific gene. We know this because rare diseases that exhibit locus heterogeneity affect different genes, but often target the same pathway. Likewise, acquired phenotypes of genetic diseases often involve inhibitors of the same key pathways that drive their genetic counterparts, without involving the protein product of the genetic form of the disease. We also know that the acquired version of most genetic diseases account for the bulk of disease occurrences. Therefore, if we want to develop treatments that benefit the greatest number of individuals affected by a disease, it would be far more practical to find treatments that target the disease-driving pathways than to design drugs that target a specific gene mutation involved in a small subset of affected patients.

Before closing, here are a few points worth considering (to be discussed in later blogs):

  • As a generalization, any drug that can block a pathway, without producing serious side effects, may serve as a candidate treatment for all of the diseases that are driven by the pathway.

  • Individuals in the early stages of common diseases, before multiple disease pathways converge to produce an intractable clinical phenotype, may be particularly amenable to treatments that interfere with the pathways that promote the ensuing steps in pathogenesis.

The topic of clinical trials designed to test drugs targeting convergent disease pathways is discussed in Precision Medicine and the Reinvention of Human Disease, Section 9.6, “Fast, Cheap, Precise Clinical Trials.”

- Jules Berman

key words: precision medicine, precision treatment, clinical trials, cost of precision medicine, pathways, convergent pathways, jules j berman Ph.D., M.D.

Sunday, February 4, 2018

National Patient Identifiers (from Precision Medicine and the Reinvention of Human Disease)

Readers from outside the United States are probably wondering why the United States agonizes over the problem of patient identification. In many other countries, individuals are given a unique national identifier, and all medical data associated with the individual is kept in a central data repository under the aegis of the government’s health service. A single, permanent identifier is used by a patient throughout life, in every encounter with a hospital, clinic, or private physician. As a resource for researchers, the national patient identifier ensures the completeness of data sets and eliminates many of the problems associated with poorly implemented local identifier systems.

In the United States, there has been fierce resistance to the idea of national patient identifiers. The call for a national patient identification system is raised from time to time. The benefits to patients and to society are many. Regardless, US citizens are reluctant to have an identifying number that is associated with a federally controlled electronic record of their private medical information. In part, this distrust results from the lack of any national insurance system in the United States. Most health insurance in the United States is private, and private insurers have wide discretion over the fees and services provided to enrollees. There is a fear that if there were a national patient identifier with centralized electronic medical records, insurers would withhold reimbursements or raise premiums or otherwise endanger the health of patients. Because the cost of US medical care is the highest in the world, medical bills for uninsured patients can quickly mount, impoverishing individuals and families.

Realistically, though, no data is safe. Medical records can be stolen, and governments can demand access to medical records, when necessary [See Lewin T. Texas orders health clinics to turn over patient data. The New York Times; October 23, 2015].

Life has its compromises. Everyone wants their privacy and we all get angry when we hear that our confidential information has been stolen. Data breaches today may involve hundreds of millions of confidential records. The majority of Americans have had social security numbers, credit card information, and private identifiers (e.g., birth dates, city of birth, names of relatives) misappropriated or stolen. It’s natural to object to anything that might jeopardize our privacy. Nonetheless, we must ask ourselves the following: “Is it rational to forfeit the very real opportunity of developing new safe and effective treatments for serious diseases, for the very small likelihood that someone will crack your deidentified research record and somehow leverage this information to your disadvantage?”

Suppose everyone in the United States were given a choice: you can be included in a national patient identifier system, or you can opt out. Most likely, there would be many millions of citizens who would opt out of the offer, seeing no particular advantage in having a national patient identifier, and sensing some potential harm. Now, suppose you were told that if you chose to opt out, you would not be permitted to use any of the therapeutic or preventive benefits that come from studies performed with data collected from the national patient identifier system. These lost benefits would include safe and effective drugs, warnings of emerging epidemics, information on side effects associated with your medications, biomarker tests for preventable illnesses, and so on. Those who made no effort to help the system would be barred from any of the benefits that the system provided. Would you reconsider your refusal to cooperate, if you knew the consequences? Of course, this is a fanciful scenario, but it makes a point.

- Jules Berman

key words: identification, confidentiality, privacy, medical identifier, NPI, national patient identifier, jules j berman, Ph.D., M.D.

Saturday, February 3, 2018

Paradoxes of Classification (and terrible Class definitions)

The formal systems that assign data objects to classes, and that relate classes to other classes, are known as ontologies. When the data within a Big Data resource is classified within an ontology, data analysts can determine whether observations on a single object will apply to other objects in the same class. Similarly, data analysts can begin to ask whether observations that hold true for a class of objects will relate to other classes of objects. Basically, ontologies help scientists fulfill one of their most important tasks; determining how things relate to other things.

A classification is a very simple form of ontology, in which each class is allowed to have only one parent class. To build a classification, the ontologist must do the following: 1) define classes (i.e., find the properties that define a class and extend to the subclasses of the class); 2) assign instances to classes; 3) position classes within the hierarchy; and 4) test and validate all the above.

The constructed classification becomes a hierarchy of data objects conforming to a set of principles:

  • The classes (groups with members) of the hierarchy have a set of properties or rules that extend to every member of the class and to all of the subclasses of the class, to the exclusion of unrelated classes . A subclass is itself a type of class wherein the members have the defining class properties of the parent class plus some additional property(ies) specific for the subclass.

  • In a hierarchical classification, each subclass may have no more than one parent class. The root (top) class has no parent class. The biological classification of living organisms is a hierarchical classification.
  • At the bottom of the hierarchy is the class instance. For example, your copy of this book is an instance of the class of objects known as "books".
  • Every instance belongs to exactly one class.
  • Instances and classes do not change their positions in the classification. As examples, a horse never transforms into a sheep, and a book never transforms into a harpsichord.
  • The members of classes may be highly similar to one another, but their similarities result from their membership in the same class (i.e., conforming to class properties), and not the other way around (i.e., similarity alone cannot define class inclusion).

Classifications are always simple; the parental classes of any instance of the classification can be traced as a simple, non-branched list, ascending through the class hierarchy. As an example, here is the lineage for the domestic horse (Equus caballus), from the classification of living organisms:

Equus caballus
Equus subg. Equus
Fungi/Metazoa group
cellular organisms

Taxonomists who view this lineage instantly grasp the place of domestic horses in the classification of all living organisms.

The rules for constructing classifications seem obvious and simplistic. Surprisingly, the task of building a logical, and self-consistent classification is extremely difficult. Most classifications are rife with logical inconsistencies and paradoxes. Let's look at a few examples.

In 1975, while touring the Bethesda, Maryland campus of the National Institutes of Health, I was informed that their Building 10, was the largest all-brick building in the world, providing a home to over 7 million bricks . Soon thereafter, an ambitious construction project was undertaken to greatly expand the size of Building 10. When the work was finished, building 10 was no longer the largest all-brick building in the world. What happened? The builders used material other than brick, and Building 10 lost its classification as an all-brick building, violating the immutability rule of class assignments.

Apparent paradoxes that plague any formal conceptualization of classifications are not difficult to find. Let's look at a few more examples.

Consider the geometric class of ellipses; planar objects in which the sum of the distances to two focal points is constant. Class Circle is a child of Class Ellipse, for which the two focal points of instance members occupy the same position, in the center, producing a radius of constant size. Imagine that Class Ellipse is provided with a class method called "stretch", in which the foci are moved further apart, thus producing flatter objects. When the parent class "stretch" method is applied to members of the Class Circle, the circle stops being a circle and becomes an ordinary ellipse. Hence the inherited "stretch" method forces members of Class Circle to transition out of their assigned class, violating the intransitive rule of classifications.

Let's look at the "Bag" class of objects. A "Bag" is a collection of objects, and the Class Bag is included in most object-oriented programming languages. A "Set" is also a collection of objects (i.e., a subclass of Bag), with the special feature that duplicate instances are not permitted. For example, if Kansas is a member of the set of U.S. States, then you cannot add a second state named "Kansas" to the set. If Class Bag were to have an "increment" method, that added "1" to the total count of objects in the bag, whenever an object is added to Class Bag, then the "increment" method would be inherited by all of the subclasses of Class Bag, including Class Set. But Class Set cannot increase in size when duplicate items are added. Hence, inheritance creates a paradox in the Class Set.

How does a data scientist deal with class objects that disappear from their assigned class and reappear elsewhere? In the examples discussed here, we saw the following:

  1. Building 10 at NIH was defined as the largest all-brick building in the world. Strictly speaking, Building 10 was a structure, and it had a certain weight and dimensions, and it was constructed of brick. "Brick" is an attribute or property of buildings, and properties cannot form the basis of a class of building, if they are not a constant feature shared by all members of the class (i.e., some buildings have bricks; others do not). Had we not conceptualized an "all-brick" class of building, we would have avoided any confusion.

  2. Class Circle qualified as a member of Class Ellipse, because a circle can be imagined as an ellipse whose two focal points happen to occupy the same location. Had we defined Class Ellipse to specify that class members must have two separate focal points, we could have excluded circles from class Ellipse. Hence, we could have safely included the stretch method in Class Ellipse without creating a paradox.

  3. Class Set was made a subset of Class Bag, but the increment method of class Bag could not apply to Class Set. We created Class Set without taking into account the basic properties of Class Bag, which must apply to all its subclasses. Perhaps it would have been better if Class Set and Class Bag were created as children of Class Collection; each with its own set of properties.

Worst Class Definition Ever

The worst definition of a Class may have been that given to the Kingdom of Protozoa, defined as the class of one-celled eukaryotic organisms. The problem here is that all of the classes of multicelled organisms (e.g., animals, plants and fungi) descended from classes of one-celled organisms. This means that Class Protozoa (defined as one-cell organisms) must exclude from its lineage all descendant classes that are multicellular. Hence, Kingdom Protozoa was given a definition that, paradoxically, excluded its own descendants. What there they thinking, back in the mid-19th century when Class Protozoa was conceived?

- Jules Berman

key words: classification, ontology, taxonomy, paradoxes, precision medicine, jules j berman Ph.D., M.D.