Saturday, February 6, 2016

Rules for Rare Diseases

Rare Disease Day is coming up February 29 (a rare day for rare diseases). In honor of the upcoming event, I'll be posting blogs all month, related to the rare diseases and to rare disease funding.

For today, please consider these three biological "Rules" that I use when I'm trying to convince my colleagues of the importance of rare disease research.

Rule - Rare diseases are not the exceptions to the general rules of disease biology; they are the exceptions upon which the general rules are based.
Brief Rationale - All biological systems must follow the same rules. If a rare disease is the basis for a general assertion about the biology of disease, then the rule must apply to the common diseases.

Every rare disease tells us something about the normal functions of organisms. When we study a rare hemoglobinopathy, we learn something about the consequences that befall when the normal hemoglobin is replaced with an abnormal hemoglobin. This information leads us to a deeper understanding of the normal role of hemoglobin. Likewise, rare urea cycle disorders, coagulation disorders, metabolic disorders, and endocrine disorders have taught us how these functional pathways operate under normal conditions (1).

Rule - Every common disease is a collection of different diseases that happen to have the same clinical phenotype.
Brief Rationale - Numerous causes and pathways may lead to the same biological outcome.

Consider the heart attack; its risk of occurrence is elevated by dozens, many hundreds of factors. Obesity, poor diet, smoking, stress, lack of exercise, hypertension, diabetes, disorders of blood lipid metabolism, infections, male gender; they all contribute to heart attacks. Regardless of the contributing factors, a common event precedes and causes the heart attack; the blockage of a coronary artery. Blockage is often caused by an atherosclerotic plaque. Consequently, rare inherited conditions that produce atherosclerotic plaques can produce the common heart attack (e.g., inherited disorders of lipid metabolism). We infer that for every common disease, there are rare, inherited disease that account for a small subset of cases.

Rule - Rare diseases inform us how to treat common diseases.
Brief Rationale - When we encounter a common disease, we look to see what pathways are dysfunctional, and we develop a rational approach to prevention, diagnosis, and treatment based on experiences drawn from the rare diseases that are driven by the same dysfunctional pathways.

Many heart attacks are caused by atherosclerotic plaque blocking a coronary artery. Many conditions produce atherosclerotic plaque, but a rare condition known as familial hypercholesterolemia is associated with some cases of coronary atherosclerosis that occur in young individuals. Studies on familial hypercholesterolemia led to the finding that statins inhibit the rate-limiting enzyme in cholesterol synthesis (hydroxymethylglutaryl coenzyme A), thus reducing the blood levels of cholesterol and blocking the formation of plaque. The treatment of a pathway operative in a rare form of hypercholesterolemia has become the most effective treatment for commonly occurring forms of hypercholesterolemia, and a mainstay in the prevention of the common heart attack (2).

[1] Wizemann T, Robinson S, Giffin R. Breakthrough Business Models: Drug Development for Rare and Neglected Diseases and Individualized Therapies Workshop Summary. National Academy of Sciences, 2009.

[2] Stossel TP. The discovery of statins. Cell 134:903-905, 2008.

- Jules Berman (copyrighted material)

key words: rare diseases, biological rules, disease funding, common diseases, complex diseases, precision medicine, jules j berman

Friday, February 5, 2016

Genes that Cause More Than One Disease

Rare Disease Day is coming up February 29 (a rare day for rare diseases). In honor of the upcoming event, I'll be posting blogs related to the rare diseases.

There are numerous examples wherein mutations in one gene may result in more than one different diseases, usually depending on the mutation involved. In some cases, each of the diseases caused by the altered gene are fundamentally similar (e.g., spherocytosis and elliptocytosis, caused by mutations in the alpha-spectrin gene; Usher syndrome type IIIA and retinitis pigmentosa-61 caused by mutations in the CLRN1 gene). In other case, diseases caused by the same gene may have no obvious relation to one another (Stickler syndrome type III (STL3) and Fibrochondrogenesis-2 and a form of non-syndromic hearing loss all caused by mutations in the COL11A2 gene).

In the following list, each disease-causing gene is followed by the different diseases caused by gene alterations.

ABCB6 gene
The Lan(-) blood group phenotype
Microphthalmia, isolated, with coloboma 7

ACTA2 gene
Moyamoya disease-5
Form of thoracic aortic aneurysm

Fish-eye disease
Norum disease

Hereditary spherocytosis-3

Parkinson disease-1
Autosomal dominant Parkinson disease-4

ALX4 gene
Frontonasal dysplasia-2
Parietal foramina-2

ANO5 gene
Gnathodiaphyseal dysplasia; gdd, or osteogenesis imperfecta with unusual skeletal lesions
Limb-girdle muscular dystrophy-2L
Miyoshi muscular dystrophy-3

ARX gene
Proud syndrome
Form of nonspecific X-linked mental retardation

ATN1 gene
Dentatorubral-pallidoluysian atrophy
Haw River syndrome

ATR gene
Seckel syndrome-1
Form of ataxia telangiectasia

BAG3 gene
Autosomal dominant myofibrillar myopathy
Dilated cardiomyopathy-1HH

BAP1 gene
Susceptibility to uveal melanoma
Predisposition to malignant mesothelioma upon asbestos exposure

BCS1L gene
Bjornstad syndrome
GRACILE syndrome

BUB1B gene
Mosaic variegated aneuploidy syndrome-1 (See Glossary item, Aneuploidy)
Form of premature chromatid separation

C20ORF54 gene
Brown-Vialetto-Van Laere syndrome, a ponto-bulbar palsy with deafness
Fazio-Londe disease

CACNA1A gene
Familial hemiplegic migraine
Spinocerebellar ataxia 6

CACNA1F gene
X-linked cone-rod dystrophy-3
Aland Island eye disease

CARD15 gene
Early-onset sarcoidosis
Blau syndrome

CASK gene
FG syndrome-4 ("FG" are the initials of the first proband)
Mental retardation, x-linked, with or without nystagmus
Mental retardation and microcephaly with pontine and cerebellar hypoplasia

Limb-girdle muscular dystrophy type 1C
Tateyama type of distal myopathy

CEP152 gene
Autosomal recessive primary microcephaly-4
Seckel syndrome-5

CEP290 gene
Bardet-Biedl syndrome 14
Joubert syndrome 5
Leber congenital amaurosis 10
Meckel syndrome 4
Senior-Loken syndrome 6

CHAT (Choline acetyltransferase) gene
Presynaptic congenital myasthenia syndrome with episodic ataxia
Familial infantile myasthenia gravis

CHX10 gene

Microphthalmia, isolated-2
Microphthalmia with coloboma-3
Isolated colobomatous microphthalmia-3

CLCN5 gene
X-linked recessive hypophosphatemic rickets
X-linked recessive nephrolithiasis with renal failure
Dent disease-1

CLN8 gene
Neuronal ceroid lipofuscinosis-8
Progressive epilepsy with mental retardation

CLRN1 gene
Usher syndrome type IIIA
Retinitis pigmentosa-61

COL11A2 gene
Stickler syndrome type III
Form of nonsyndromic hearing loss

COL2A1 gene
Stickler syndrome type I, sometimes called membranous vitreous type
Osteoarthritis with mild chondrodysplasia
Achondrogenesis type II
Czech dysplasia

COL7A1 gene
Classic dystrophic epidermolysis bullosa pruriginosa
Nonsyndromic congenital nail disorder-8

COL9A1 gene
Form of autosomal recessive form of Stickler syndrome
Multiple epiphyseal dysplasia-6

COL9A2 gene
Multiple epiphyseal dysplasia-2
Stickler syndrome type V

Autosomal dominant epidermolysis bullosa dystrophica
Pretibial dystrophic epidermolysis bullosa
Stickler syndrome
Strudwick type of spondyloepimetaphyseal dysplasia
Spondyloperipheral dysplasia
Ehlers-Danlos syndrome type IV

Keratitis-ichthyosis-deafness syndrome
Deafness, autosomal dominant-3A

CRYAB gene
Posterior polar cataract-2
Fatal infantile hypertonic myofibrillar myopathy

CYLD gene
Familial cylindromatosis
Multiple familial trichoepithelioma-1
Brooke-Spiegler syndrome

DOCK8 gene
Hyper-IgE recurrent infection syndrome, also known as Job syndrome
Autosomal dominant mental retardation-2

DYM gene
Dyggve-Melchior-Clausen disease
Smith-McCort dysplasia

DYNC1H1 gene
Autosomal dominant axonal Charcot-Marie-Tooth disease type 2O
Autosomal dominant mental retardation-13

ENPP1 gene
Generalized arterial calcification of infancy-1
Autosomal recessive hypophosphatemic rickets-2

ESCO2 gene
SC phocomelia syndrome, also known as SC pseudothalidomide syndrome
Roberts syndrome

FBLN5 gene
Autosomal recessive cutis laxa type IA
Macular degeneration, age-related-3

FBN1 gene
Acromicric dysplasia
Stiff skin syndrome
Autosomal dominant form of isolated ectopia lentis
Weill-Marchesani syndrome-1
Weill-Marchesani syndrome-2
Geleophysic dysplasia-2

FGFR1 gene
8p11 myeloproliferative disorder

FGFR2 gene
Beare-Stevenson cutis gyrata syndrome
Form of craniosynostosis
Classic Crouzon syndrome

FGFR3 gene
Muenke craniosynostosis syndrome
CATSHL syndrome
Crouzon syndrome with acanthosis nigricans

FIG4 gene
Charcot-Marie-Tooth type 4J
Form of autosomal dominant ALS
Amyotrophic lateral sclerosis 11

FLNA gene
Terminal osseous dysplasia
FG syndrome-2
X-linked cardiac valvular dysplasia

FLNC gene
Filamin C-related myofibrillar myopathy
Distal myopathy-4 (MPD4), also known as Williams distal myopathy

FMR1 gene
Fragile X tremor/ataxia syndrome
Fragile X mental retardation syndrome

FOXL2 gene
Blepharophimosis, ptosis, and epicanthus inversus syndrome, with premature ovarian failure (BPES type I)
Blepharophimosis, ptosis, and epicanthus inversus syndrome, without premature ovarian failure without premature ovarian failure (BPES type II)

FREM1 gene
Bifid nose with or without anorectal and renal anomalies

GATA2 gene
Primary lymphedema with myelodysplasia
Dendritic cell, monocyte, B lymphocyte, and natural killer lymphocyte deficiency

GDAP1 gene
Autosomal recessive axonal CMT with vocal cord paresis
Autosomal recessive demyelinating CMT4A
Autosomal recessive axonal Charcot-Marie-Tooth disease type 2K

GDF3 gene
Klippel-Feil syndrome-3
Isolated microphthalmia with coloboma-6
Isolated microphthalmia-7

GDF6 gene
Klippel-Feil syndrome-1
Isolated microphthalmia-4

GJA1 gene
Syndactyly type III
Oculodentodigital dysplasia
Atrioventricular septal defect 3

GJB2 gene
Autosomal recessive deafness-1A
Hystrix-like ichthyosis-deafnesss syndrome

GJC2 gene (encodes gap junction protein, gamma 2)
Autosomal recessive spastic paraplegia-44
Hereditary lymphedema type IC
Form of Pelizaeus-Merzbacher disease

Familial hyperinsulinemic hypoglycemia-3
Maturity onset diabetes of the young-2

GNAS gene
Progressive osseous heteroplasia
Pseudohypoparathyroidism type Ia

GPR143 gene
Ocular albinism type I
X-linked congenital nystagmus-6
Nystagmus 6, congenital, X-linked

HCN4 gene
Brugada syndrome-8
Autosomal dominant form of sick sinus syndrome

Isolated microphthalmia with coloboma-5

HPRT gene
Lesch-Nyhan syndrome
Kelley-Seegmiller syndrome

HRG gene
Histidine-rich glycoprotein deficiency

HSPB8 gene
Axonal Charcot-Marie-Tooth disease type 2L

IGHMBP2 gene
Distal hereditary motor neuronopathy type VI (dHMN6 or HMN6)
Spinal muscular atrophy, with respiratory distress-1

INF2 gene
Focal segmental glomerulosclerosis-5
Charcot-Marie-Tooth disease E with focal segmental glomerulonephritis

JAK2 gene
Polycythemia vera, the most common form of primary polycythemia

KCNE2 gene
Form of atrial fibrillation
Long QT syndrome-6

KCNH2 gene
Long QT syndrome-2
Short QT syndrome-1

KCNJ11 gene
Hyperinsulinemic hypoglycemia-2 (HHF2)

KCNJ5 gene
Familial hyperaldosteronism type III
Long QT syndrome-13

KCNQ1 gene
Form of Jervell and Lange-Nielsen syndrome (JLNS1)
Form of autosomal dominant atrial fibrillation
ATFB3 (607554)
Short QT syndrome-2

KIF1A gene
Hereditary sensory neuropathy type IIC
Form of mental retardation

KLF1 gene
Congenital dyserythropoietic anemia type IV (See Glossary item, Dyserythropoiesis)
Form of hereditary persistence of fetal hemoglobin

KRT74 gene
Hypotrichosis simplex of the scalp-2
Autosomal dominant form of woolly hair
Hypotrichosis simplex of the scalp-2

LDB3 gene
Left ventricular noncompaction-3
Form of dilated cardiomyopathy with or without left ventricular noncompaction

LMNA gene
Form of autosomal recessive axonal CMT
Slovenian type heart-hand syndrome

LRP4 gene
Cenani-Lenz syndactyly syndrome

LRP5 gene
Familial exudative vitreoretinopathy-4
Autosomal dominant osteopetrosis type I

Form of multiple epiphyseal dysplasia
Form of autosomal recessive spondyloepimetaphyseal dysplasia

MECP2 gene
Form of neonatal severe encephalopathy
Classic Rett syndrome

MED12 gene
Lujan-Fryns syndrome
Opitz-Kaveggia syndrome, also known as FG syndrome-1

MFRP gene
Posterior microphthalmia, retinitis pigmentosa, foveoschisis, and optic disc drusen

MLL2 gene
Kabuki syndrome-1
Otitis media in infancy

MSX1 gene
Form of selective tooth agenesis
Orofacial cleft 5
Witkop syndrome

MYH6 gene
Familial hypertrophic cardiomyopathy-14
Form of dilated cardiomyopathy

MYH7 gene
Form of scapuloperoneal myopathy
Hypertrophic cardiomyopathy-1
Cardiomyopathy, dilated, 1S

MYH9 gene
Fechtner syndrome
May-Hegglin anomaly
Sebastian syndrome

NEMO gene
Anhidrotic ectodermal dysplasia with immunodeficiency, osteopetrosis, and lymphedema
Atypical mycobacteriosis, familial
Familial incontinentia pigmenti
Invasive pneumococcal disease, recurrent isolated, type 2

NF1 gene
Watson syndrome
Neurofibromatosis-Noonan syndrome variant of neurofibromatosis-1

NHS gene
Nance-Horan syndrome
X-linked congenital cataract

NKX2-5 gene
Atrial septal defect of the secundum type, with or without atrioventricular conduction defects
Congenital nongoitrous hypothyroidism-5
Hypoplastic left heart syndrome-2

NOTCH2 gene
Hajdu-Cheney syndrome
Alagille syndrome-2

NPHP1 gene
Senior-Loken syndrome-1
Form of Joubert syndrome plus nephronophthisis

NPHP3 gene
Meckel syndrome, type 7

NPHP4 gene
Form of Senior-Loken syndrome that maps to 1p36
Type 4 nephronophthisis

NPHP6 gene
Form of Senior-Loken syndrome that maps to 12q21-32
Joubert syndrome-5

NR0B1 gene
X-linked congenital adrenal hypoplasia with hypogonadotropic hypogonadism
46,XY sex reversal-2

NR5A1 gene
Premature Ovarian Failure-7
Form of 46,XY sex reversal

NRAS gene
Form of Noonan syndrome (NS6)
Form of autoimmune lymphoproliferative syndrome, designated type IV (ALPS4)

NSD1 gene
Familial Sotos syndrome
Sotos syndrome
Weaver syndrome-1
Classic Sotos syndrome

OPTN gene
Amyotrophic lateral sclerosis-12
Form of adult-onset primary open angle glaucoma (POAG), designated GLC1E

Ectrodactyly, ectodermal dysplasia, and cleft lip/palate syndrome-3
Split-hand/split-foot malformation

PAX3 gene
Craniofacial-deafness-hand syndrome
Waardenburg syndrome type-3
Waardenburg syndrome type-1

PDE6B gene
Autosomal dominant congenital stationary night blindness-2
Form of retinitis pigmentosa

PDE8B gene
Autosomal dominant striatal degeneration
Primary pigmented nodular adrenocortical disease-3

PDX1 gene
Congenital pancreatic agenesis
Maturity onset diabetes of the young-4

PIGA gene
Paroxysmal nocturnal hemoglobinuria
Multiple congenital anomalies-hypotonia-seizures syndrome-2

PLA2G6 gene
Neurodegeneration with brain iron accumulation-2A
Neurodegeneration with brain iron accumulation-2B
Adult-onset dystonia-parkinsonism, also known as Parkinson disease-14

PLEC1 gene
Epidermolysis bullosa simplex with pyloric atresiawhich
Epidermolysis bullosa simplex
Autosomal recessive limb-girdle muscular dystrophy type 2Q

POLG gene
Alpers syndrome
Neurogastrointestinal encephalopathy

Autosomal recessive progressive external ophthalmoplegia (PEOB)
Sensory ataxic neuropathy, dysarthria, and ophthalmoparesis

POMGNT1 gene
Walker-Warburg syndrome (WWS) or muscle-eye-brain disease
Muscular dystrophy-dystroglycanopathy-B3
Muscular dystrophy-dystroglycanopathy-C3

PRKAR1A gene
Acrodysostosis with hormone resistance
Carney complex, type 1

PROM1 gene
Macular dystrophy, retinal, type 2
Stargardt disease-4

Stargardt disease-4
Retinal macular dystrophy-2
Cone-rod dystrophy-12

PRPS1 gene
Arts syndrome
X-linked deafness-1

PRRT2 gene
Familial infantile convulsions with paroxysmal choreoathetosis
Benign familial infantile seizures-2
Paroxysmal kinesigenic dyskinesia

PSEN1 gene
Dilated cardiomyopathy-1U
Familial acne inversa-3
Form of early onset Alzheimer's disease

PTPN11 gene
Noonan syndrome-1

PYCR1 gene
Autosomal recessive cutis laxa type IIIB
Autosomal recessive cutis laxa type IIB

RAB27A gene
Melanosis with immunologic abnormalities with or without neurologic impairment
Griscelli syndrome type 2

RAF1 gene
Form of Noonan syndrome
LEOPARD syndrome-2

RDS gene
Retinitis pigmentosa-7
Adult-onset vitelliform macular dystrophy (AVMD)

RET gene
Susceptibility to Hirschsprung disease-1
Multiple endocrine neoplasia-2B
Familial medullary thyroid carcinoma MTC

ROR2 gene
Brachydactyly type B1
Autosomal recessive Robinow syndrome

RPE65 gene
Leber congenital amaurosis-2
Form of autosomal recessive retinitis pigmentosa

RPGR gene
Retinitis pigmentosa-3
X-linked cone-rod dystrophy
X-linked retinitis pigmentosa with recurrent respiratory infections

RPGRIP1 gene
Autosomal recessive cone-rod dystrophy-13
Leber congenital amaurosis-6

SAMHD1 gene
Aicardi-Goutieres syndrome-5
Chilblain lupus-2

SCN1A gene
Febrile seizures, familial, type 3A
Familial hemiplegic migraine-3

SCN1B gene
Generalized epilepsy with febrile seizures plus, type 1
Brugada syndrome-5

SCN2A gene
Benign familial neonatal-infantile seizures-3
Early infantile epileptic encephalopathy-11

SCN4A gene
Hypokalemic periodic paralysis type 2
Form of congenital myasthenic syndrome

SCN5A gene
Brugada syndrome-1
Long QT syndrome-3
Sick sinus syndrome (some cases)
Atrial fibrillation, (some cases)
Dilated cardiomyopathy (some cases)

SEMA4A gene
Form of RP
Cone-rod dystrophy-10

SH3TC2 gene
Charcot-Marie-Tooth disease type 4C
Mild mononeuropathy of the median nerve

SHH gene
Microphthalmia with coloboma 5

SLC16A1 gene
Erythrocyte lactate transporter defect
Form of hyperinsulinemic hypoglycemia

SLC25A19 gene
Amish lethal microcephaly
Thiamine metabolism dysfunction syndrome-3
Bilateral striatal degeneration and progressive polyneuropathy

SLC26A4 gene
Enlarged vestibular aqueduct
Pendred syndrome

SLC2A1 gene
Dystonia 18 (DYT18)
Autosomal recessive primary hypertrophic osteoarthropathy-2

SLC33A1 gene
Spastic paraplegia-42
Congenital cataracts, hearing loss, and neurodegeneration

SLC34A1 gene
Autosomal recessive form of Fanconi renotubular syndrome
Hypophosphatemic nephrolithiasis/osteoporosis-1
Fanconi renotubular syndrome-2

SLC4A1 gene
Band 3 Coimbra
Waldner blood group expression
Autosomal recessive distal renal tubular acidosis with hemolytic anemia

SLC4A11 gene
Corneal endothelial dystrophy-2
Fuchs endothelial corneal dystrophy-4

SMAD4 gene
Myhre syndrome
Juvenile polyposis syndrome

SOS1 gene
Gingival fibromatosis-1
Form of Noonan syndrome

SOST gene
Craniodiaphyseal dysplasia, autosomal dominant
Van Buchem disease

STAT1 gene
Mycobacterial and viral infections, susceptibility to, autosomal recessive
Familial chronic mucocutaneous candidiasis-7

SYCP3 gene
Spermatogenic failure 4
Recurrent pregnancy loss 4

TGFBR2 gene
Loeys-Dietz syndrome type 2B
Hereditary nonpolyposis colorectal cancer-6

Autosomal dominant dilated cardiomyopathy-1G
Limb-girdle muscular dystrophy type 2J
Tardive tibial muscular dystrophy

TMEM216 gene
Meckel syndrome type 2
Joubert syndrome-2

TNFRSF13B gene
Immunoglobulin A (IgA) deficiency-2
Common variable immunodeficiency-2

TREX1 gene
Aicardi-Goutieres syndrome-1 (can also be caused by mutations in the SAMHD1, TREX1, or Ribonuclease H2 genes)
Chilblain lupus-1

TRPV4 gene
Brachyolmia type 3
Metatropic dysplasia
Parastremmatic dwarfism
Form of scapuloperoneal spinal muscular atrophy
Maroteaux type of spondyloepiphyseal dysplasia
Kozlowski type of spondylometaphyseal dysplasia
Congenital distal spinal muscular atrophy
Hereditary motor and sensory neuropathy type IIC

TTR gene
Form of hereditary amyloidosis
Euthyroidal hyperthyroxinemia

TULP1 gene
Retinitis pigmentosa-14
Leber congenital amaurosis-15

VHL gene
Von Hippel-Lindau syndrome
Familial erythrocytosis-2

VSX1 gene
Posterior polymorphous corneal dystrophy-1
Craniofacial anomalies and anterior segment dysgenesis syndrome

WAS gene
Wiskott-Aldrich syndrome
X-linked thrombocytopenia
X-linked neutropenia

WDR35 gene
Cranioectodermal dysplasia-2
Short rib-polydactyly syndrome type V

WNK1 gene
Hereditary sensory and autonomic neuropathy type IIA
Form of pseudohypoaldosteronism type II

- Jules Berman

key words: rare diseases, allelic heterogeneity, allelic to, polymorphism, gene variation, genetic heterogeneity, genetics, genetics of disease, jules j berman

Thursday, February 4, 2016

A Species is a Biological Entity; Not a Mere Intellectual Abstraction

In the Disney retelling of a classic fairy tale, a human-made abstraction, a puppet named Pinocchio survives a series of perils and emerges as a real live boy. It seems farfetched that an abstraction could become a living biological organism, but it happens. In point of fact, the transformation of an abstract idea into a living entity is one of the most important scientific advancements of the past half century. For the most part, this miracle of science has gone unheralded. Nonetheless, if you think very deeply about the meaning of classifications, and if you can appreciate the role played by abstractions in the governance of our physical universe, you will appreciate the profound implications of the following story. We shall see that a human-made abstraction, that we name "species", has survived a series of perils, and has emerged as a real live biological entity.

In the classification of living terrestrial organisms, the bottom classes are known as "species". There is a species class for all the horses and another species class for all the squirrels, and so on. Speculation has it that there are 50 to 100 million different species of organisms on planet earth. We humans have assigned names to a few million species, a small fraction of the total.

It has been argued that nature produces individuals, not species; the concept of species being a mere figment of the human imagination, created for the convenience of taxonomists who need to group similar organisms. Biologists can collect feature data such as gene sequences, geographic habitat, diet, size, mating rituals, hair color, shape of skull and so on, for a variety of different animals. After some analysis, perhaps performed with the aid of a computer, we could cluster animals based on their similarities, and we could assign the clusters names, and the names of our clusters would be our species. The arbitrariness of species creation comes from the various ways we might select the features to be measured in our data sets, the choice of weights assigned to the the different features (e.g., should we give more weight to gene sequence than to length of gestation?), and to our choice of algorithm for assigning organisms to groups.

For myself, and for many other scientists who use classification, there can be no human arbitrariness in the assignment of species (1). A species is a fundamental building block of the natural world, no less substantial than the concept of a galaxy to astronomers or the number "e" to mathematicians.

The modern definition of species is "an evolving gene pool." As such, species have three properties that prove that they are biological entities.

1. Unique definition. Until recently, biologists could not agree on a definition of species. There were dozens of definitions to choose from, depending on which field of science you studied. Molecular biologists defined species by gene sequence. Zoologists defined species by mating exclusivity. Ecologists defined species by habitat constraints. The current definition equating species with an evolving gene pool serves as a great unifying theory for biologists.

2. The class "species" has a biological function that is not available to individual members of the species; namely, speciation. Species propagate, and when they do, they produce new species. Species are the only biological entities that can produce new species.

3. Species evolve. Individuals do not evolve. Evolution requires a gene pool; something that species have and individuals to not.

Species bear a biological relationship to individual organisms. Just as species are defined as evolving gene pools, individual organisms can be defined as set of propagating genes living within a cellular husk. Hence, the individual organism has a genome taken from the pool of genes available to his species.

The classification of living organisms has worked a true miracle, by breathing life into the concept of species, thus expanding reality.

[1] DeQueiroz K. Ernst Mayr and the modern concept of species. PNAS 102(suppl 1):6600-6607, 2005.

- Jules Berman (copyrighted material)

key words: classsification, ontology, species, speciation, jules j berman

Wednesday, February 3, 2016

Unclassifiable objects

Classifications create a class for every object and taxonomies assign each and every object to its correct class. This means that a classification is not permitted to contain unclassified objects; a condition that puts fussy taxonomists in an untenable position. Suppose you have an object, and you simply do not know enough about the object to confidently assign it to a class. Or, suppose you have an object that seems to fit more than one class, and you can't decide which class is the correct class. What do you do? Historically, scientists have resorted to creating a "miscellaneous" class into which otherwise unclassifiable objects are given a temporary home, until more suitable accommodations can be provided. I have spoken with numerous data managers, and everyone seems to be of a mind that "miscellaneous" classes, created as a stopgap measure, serve a useful purpose. Not so. Historically, the promiscuous application of "miscellaneous" classes have proven to be a huge impediment to the advancement of science. In the case of the classification of living organisms, the class of protozoans stands as a case in point. Ernst Haeckel, a leading biological taxonomist in his time, created the Kingdom Protista (i.e., protozoans), in 1866, to accommodate a wide variety of of simple organisms with superficial commonalities. Haeckel himself understood that the protists were a blended class that included unrelated organisms, but he believed that further study would resolve the confusion. In a sense, he was right, but the process took much longer than he had anticipated; occupying generations of taxonomists over the following 150 years. Today, Kingdom Protista no longer exists. Its members have been reassigned to various classes of unicellular eukaryotes. Nonetheless, textbooks of microbiology still describe the protozoans, just as though this name continued to occupy a legitimate place among terrestrial organisms. In the meantime, therapeutic opportunities for eradicating so-called protozoal infections, using class-targeted agents, have no doubt been missed (1). You might think that the creation of a class of living organisms, with no established scientific relation to the real world, was a rare and ancient event in the annals of biology, having little or no chance of being repeated. Not so. A special pseudoclass of fungi, deuteromyctetes (spelled with a lowercase "d", signifying its questionable validity as a true biologic class) has been created to hold fungi of indeterminate speciation. At present, there are several thousand such fungi, sitting in a taxonomic limbo, waiting to be placed into a definitive taxonomic class (2), (1).

[1] Berman JJ. Taxonomic Guide to Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms. Academic Press, Waltham, 2012.

[2] Guarro J, Gene J, Stchigel AM. Developments in fungal taxonomy. Clinical Microbiology Reviews 12:454-500, 1999.

- Jules Berman (copyrighted material)

key words: classifications, ontology, classes, taxonomy, jules j berman

Tuesday, February 2, 2016

When Reviewing Sets of Data, Always Examine the Range

After you have had a chance to look at the data, it is prudent to determine the highest and the lowest observed values in your data collection (i.e., the range of the data). These two numbers are often the most important numbers in any set of data; even more important than determining the average or the standard deviation. Where the data begins and ends tells the data scientists a great deal about the intrinsic meaning of the data. Moreover, your data must fit within the range of the device that produced the data measurements. Most devices have a range for which they can detect data fairly accurately, the so-called dynamic range (See Glossary item, Accuracy versus precision). Below that range, they might register the measurement as zero, or some fixed minimum value,or as some random value (i.e., noise). Above the range, the instrument might register a fixed maximum value, or some number larger than the maxima (i.e., more noise). Ideally, all of the data elements in your collection will fall well within the dynamic range of the measurement instrument. In any case, it is vital to know the range of the measured data and the dynamic range of the measurement instrument. Data values higher than or lower than the dynamic range do not contain useful information.

It really is not unusual for otherwise intelligent data scientists to develop sophisticated data models for totally spurious measurements that lie outside the dynamic range of their instruments (See Glossary item, Data modeling). Here is an example. You are looking at human subject data that includes weights. You find that the maximum weight in the data set is 300 pounds, exactly. There are many individuals in the data set who have a weight of 300 pounds, but no individuals with a weight exceeding 300 pounds. You also find that the number of individuals weighing 300 pounds is much greater than the number of individuals weighting 290 pounds. What does this tell you? Obviously, the people included in the data set have been weighed on a scale that tops off at 300 pounds. Most of the people whose weight was recorded as 300 will have a false weight measurement. Had we not looked for the maximum value in the data set, we would have assumed, incorrectly, that the weights were valid (1).

It might be useful to get some idea of how weights are distributed in the population exceeding 300 pounds (i.e., the population outside the dynamic range of the scale). One way of estimating the error is to look at the number of people weighing 295 pounds, 290 pounds, 285 pounds, etc. By observing the trend, and knowing the total number of individuals who weigh at least 300 pounds, you can estimate the number of people falling into the weight categories exceeding 300 pounds.

Here is another example where knowing the maxima for a data set measurement is useful. You are looking at a collection of data on meteorites. The measurements includes weights. You notice that the largest meteorite in the large collection weighs 66 tons (equivalent to about 60,000 kilograms), and has a diameter of about 3 meters. Small meteorites are more numerous than large meteorites, but almost every weight category is accounted for by one or more meteorites, up to 66 tons. After that, nothing. You check the published data on meteorites and find that none of your colleagues have reported finding meteorites weighing in excess of about 66 tons. Why do meteorites have a maximum size of about 66 tons (See Glossary items, Meta-analysis, Missing values)?

A little checking tells you that meteors in space can come in just about any size, from a speck of dust to a moon-sized rock. Collisions with earth have involved meteorites much larger than 3 meters. You check the astronomical records and you find that the meteor that may have caused the extinction of large dinosaurs about 65 million years ago, was estimated at 6 to 10 kilometers (at least 2000 times the diameter of the largest meteorite found on earth).

There is a very simple reason why the largest meteorite found on earth weighs about 66 tons, while the largest meteorites to impact the earth are known to be thousands of time heavier. When meteorites exceed 66 tons, the impact energy can exceed the energy produced by an atom bomb blast. Meteorites larger than 66 tons leave an impact crater, but the meteor itself disintegrates on impact (1).

As it turns out, much is known about meteorite impacts. The kinetic energy of the impact is determined by the mass of the meteor and the square of the velocity. The minimum velocity of a meteor at impact is about 11 km/second (equivalent to the minimum escape velocity for sending an object from earth into space). The fastest impacts occur at about 70 km per second. From this data, the energy released by meteors, on impact with the earth, can be easily calculated.

By observing the maximum weight of meteors found on earth we learn a great deal about meteoric impacts. When we look at the distribution of weights, we can see that small meteorites are more numerous than larger meteorites. If we develop a simple formula that relates the size of a meteorite with its frequency of occurrence, we can predict the likelihood of the arrival of a meteorite on earth, for every weight of meteorite, including those weighing more than 66 tons, over any interval of time.

[1] Berman JJ. Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Morgan Kaufmann, Waltham, MA, 2013.

- Jules Berman (copyrighted material)

key words: range, dynamic range, maxima, minima, maximum, minimum, data analysis, data science, data simplification, jules j berman

Monday, February 1, 2016

When to terminate (or at least reconsider) a data repurposing project

"Not everything that counts can be counted, and not everything that can be counted counts." - William Bruce Cameron

The most valuable features of data worth repurposing are:

1. Data that establishes uniqueness or identity
2. Data that accrues over time, documenting the moments when data objects are obtained (i.e., time-stamped data)
3. Data that establishes membership in a defined group or class
4. Data that is classified, for every object in a knowledge domain
5. Introspective data - data that explains itself

A different set of properties characterize data sets that are virtually useless for data repurposing projects.

1. Data sets that are incomplete or unrepresentative of the subject domain. You cannot draw valid conclusions, if the data you are analyzing is unrepresentative of the data domain under study.

Having a large set of data does not guarantee that your data is complete and representative. Danah Boyd, a social media research, gives the example of a scientist who is analyzing the complete set of tweets made available by Twitter (1). If Twitter removes tweets containing expletives, or tweets composed of non-word character strings, or tweets containing highly charged words, or tweets containing certain types of private information, then the resulting data set, no matter how large it may be, is not representative of the population of senders (See Glossary item. Privacy versus confidentiality). If the tweets are available as a set of messages, without any identifier for senders, then the compulsive tweeters (those who send hundreds or thousands of tweets) will be over-represented, and the one-time tweeters will be under-represented. If each tweet were associated with an account, and all the tweets from a single account were collected as a unique record, then there would still be the problem created by tweeters who maintain multiple accounts (See Glossary item, Representation bias).

Contrariwise, having a small amount of data is not necessary fatal for data repurposing projects. If the data at hand cannot support your intended analysis, it may be sufficient to answer an alternate set of questions, particularly if the data indicate large effects and achieve statistical significance. In addition, small data sets can be merged with other small or large data sets to produce representative and complete aggregate data collections.

2. Data that lacks metadata. It may seem a surprise to some, but most of the data collected in the world today is poorly annotated. There is no way to determine how the data elements were obtained, or what they mean, and there is no way of verifying the quality of the data.

3. Data without unique identifiers. If there is no way to distinguish data objects, then it impossible to distinguish 10 data values that apply to one object versus 10 data values that apply to 10 different objects.

The term "identified data," a concept that is central to data science, must be distinguished from "data that is linked to an identified individual," a concept that has legal and ethical importance. In the privacy realm, the term, "data that is linked to an identified individual," is shortened to "identified data," and this indulgence has caused no end of confusion. All good data must be identified. Private data can be deidentified, in the regulatory sense, by removing any links between the data and the person to whom the data applies (See Glossary items, Deidentification, Deidentification versus anonymization, Reidentification). The data itself should never be deidentified (i.e., a unique alphanumeric identifier for every data object must exist). Removing links that connect the data object to an individual is all that is necessary for so-called privacy deidentification.

4. Undocumented data (e.g., data with no known creator, or no known owner, or with no "rights" statement indicating who may use the data and for what purposes). Data scientists cannot assume that they can legally use every data set that they acquire.

5. Illegal data or legally encumbered data or unethical data. Data scientists cannot assume that they have no legal liability when they use data that was appropriated unlawfully.

Data quality is serious business. The U.S. government passed the Data Quality Act in 2001, as part of the FY 2001 Consolidated Appropriations Act (Pub. L. No. 106-554). The Act requires Federal Agencies to base their policy decisions on high quality data and to permit the public to challenge and correct inaccurate data (2), (3). The drawback to this legislation, is that science is a messy process, and data may not always attain a high quality. Data that fails to meet standards of quality may be rejected by government committees or may be used to abrogate policies that were based on the data (4), (5).


[1] Boyd D. 2010. "Privacy and publicity in the context of big data." Open Government and the World Wide Web (WWW2010). Raleigh, North Carolina, April 29, 2010. Available from:, viewed August 26, 2012.

[2] Data Quality Act. 67 Fed. Reg. 8,452, February 22, 2002, addition to FY 2001 Consolidated Appropriations Act (Pub. L. No. 106-554. codified at 44 U.S.C. 3516).

[3] Guidelines for ensuring and maximizing the quality, objectivity, utility, and integrity of information disseminated by federal agencies. Federal Register Vol. 67, No. 36, February 22, 2002.

[4] Sass JB, Devine JP Jr. The Center for Regulatory Effectiveness invokes the Data Quality Act to reject published studies on atrazine toxicity. Environ Health Perspect 112:A18, 2004.

[5] Tozzi JJ, Kelly WG Jr, Slaughter S. Correspondence: data quality act: response from the Center for Regulatory Effectiveness. Environ Health Perspect 112:A18-19, 2004.

- Jules Berman (copyrighted material)

key words: data science, data repurposing, data renalysis, data analysis, primary data, secondary data, data quality act, jules j berman

Sunday, January 31, 2016

Decoding Mayan glyphs: using data science to discover a lost civilization

"It is an amazement, how the voice of a person long dead can speak to you off a page as a living presence." - Garrison Keillor

On the Yucatan peninsula, concentrated within a geographic area that today encompasses the southeastern tip of Mexico, plus Belize, and Guatemala, a great civilization flourished. The Mayan civilization seems to have begun about 2000 BCE, reaching its peak in the so-called classic period (250 - 900 AD). Abruptly, about 900 AD, the great Mayan cities were abandoned, and the Mayan civilization entered a period of decline. Soon after the Spanish colonization of the peninsula, in the 16th century, the Mayans were subjected to a deliberate effort to erase any trace of their heritage. The desecration of the Mayans was led by a Spanish priest named Diego de Landa Calderon (1524-1579). Landa's acts against Mayan culture included:

1. The destruction of all Mayan books and literature (only a few books survived immolation).

2. The conversion of Mayans to Catholicism, in which school-children were forced to learn Roman script and Arabic numerals.

3. The importation of the Spanish Inquisition, accounting for the deaths of many Mayans who preferred their own culture over that of Landa's.

By the dawn of the 2Oth century, the great achievements of the Mayan civilization were forgotten, its cities and temples were thoroughly overgrown by jungle, its books had been destroyed, and no humans on the planet could decipher the enduring stone glyph tablets strewn through the Yucatan peninsula.

In the late twentieth century, culminating from several centuries of effort by generations of archeologists and epigraphers, the Maya glyphs were successfully decoded. The successful decoding of the Mayan glyphs and the discovery of the history and achievements of the Mayan civilization, during its classic period, is, perhaps, the most exciting legacy data project ever undertaken. The story of the resurrection and translation of the Mayan glyphs leaves us with many lessons that apply to modern-day data repurposing projects.

Maya stucco glyphs diplayed in the museum at Palenque, Mexico.
Image source: Wikipedia, public domain, from,
where there is an excellent discussion of Mayan script
Lesson 1. Success follows multiple breakthroughs, sometimes occurring over great lengths of time.

The timetable for the Mayan glyph project extends over more than three centuries.

1566 - Landa, the same man largely responsible for the destruction of the Mayan culture and language, wrote a manuscript in which he attempted to record a one-to-one correspondence between the roman alphabet and the Mayan alphabet, with the help of local Mayans. Landa had assumed that the Mayan language was alphabetic, like the Spanish language. As it happens, the Mayan language is logophonetic, with some symbols corresponding to syllables and other symbols corresponding to words and concepts. For centuries, the so-called Mayan alphabet only added to the general confusion. Eventually, Landa's notes were used, with the few surviving Mayan codices, to crack the Mayan code.

1832 - Constantine Rafinesque decoded the Mayan number system.

1880 - Forstemann, working from an office in Dresden, Germany, had access to the Dresden Codex, one of the few surviving Mayan manuscripts. Using Rafinesque's techniques to decode the numbers that appeared in the Dresden Codex, Forstemann deduced how the Mayans recorded the passage of time, and how they used numbers to predict astronomic events, with great accuracy.

1952 - Yuri Knorosov, working alone in Russia, deduced how individual glyph sympbols were used as syllables.

1958 - Tatiana Proskouriakoff, using Knorosov's syllabic approach to glyph interpretation, convincingly made the first short translations from stelae (i.e., standing stone monuments), and proved that they told the life stories of Mayan kings.

1973 - 30 Mayanists from various scientific disciplines convened at a Palenque, a Mayan site, and, through a team effort, deciphered the dynastic history of six kings.

1981 - David Stuart showed that different pictorial symbols could represent the same symbol, so long as the beginning sound of the word represented by the symbol was the same as the beginning sound of the other syllable-equivalent words. This would be analagous to a picture of a ball, a balance, and a banner, all serving as as interchangeable forms of the sound "ba".

Following Stuart's 1981 breakthrough, the Mayan code was essentially broken.

Lesson 2. Contributions comes from individuals working in isolation and individuals working as a team.

"My feeling is that as far as creativity is concerned, isolation is required." - Isaac Asimov (1)

As social animals, we tend to believe in the supremacy of teamwork. We often marginalize the contributions of individuals who work in isolation. Objective review of most large, successful projects reveals that important contributions come from individuals working in isolation, plus teams, working to accomplish goals that could not be achieved through the efforts of an individual. The task of decoding the Mayan glyphs was assisted by two key individuals, each working in isolation, thousands of miles from Mexico: Ernst Forstemann, in Germany, and Yuri Knorozov, in Moscow. It is difficult to imagine how the Mayan project could have succeeded without the contributions of these two loners. The remainder of the project was accomplished within a community of scientists who cleared the long-forgotten Mayan cities, recovered glyphs, compared the findings at the different sites, and eventually reconstructed the language. Throughout this book, we will examine legacy projects that succeeded due to the combined efforts of teams and of isolated individuals.

Lesson 3. Project contributors come from many different disciplines.

The team of 30 experts convening in Palenque, in 1973 was composed of archeologists, epigraphers, linguists, anthropologists, historians, astronomers, and ecologists.

Lesson 4. Progress was delayed due to influential naysayers.

After the Mayan numbering system had been decoded, and after it was shown that the Mayans were careful recorders of time, and astronomic events, linguists turned their attention to the fascinating legacy of the non-numeric glyphs. Try as they might, Mayanists of the mid-20th century could make no sense of the non-numeric symbols. Eric Thompson (1898 - 1975) stood as the premier Mayanist authority from the 1930s through the 1960s. After trying, and failing, to decipher the non-numeric glyphs, he concluded that these glyphs represented mystic, ornate symbols; not language. The non-numeric glyphs, in his opinion, could not be deciphered because they had no linguistic meaning. Thompson was venerated to such an extent that, throughout his long tenure of influence, all progress in the area of glyph translation was suspended. When Thompson's influence finally waned, a new group of Mayanists came forward to crack the code.

Lesson 5. Ancient legacy data conformed to modern annotation practices.

The original data had a set of properties that were conducive to repurposing: unique, identified objects (e.g., name of king, and name of city), with a time-stamp on all entries, implying the existence of a sophisticated calendar and time-keeping methods). The data was encoded in a sophisticated number system, that included the concept of zero, and was annotated with metadata (i.e., descriptions of the quantitative data. See Glossary item, Metadata).

Lesson 6. Legacy data is often highly accurate data.

Old data is often accurate data, if it is recorded at the time and place that events transpired. Records of crops, numbers of sacrifices, numbers of slaves traded, are the most objective data that we are likely to encounter. In the particular example of the astronomical data included in the Dresden Codex, Mayan astronomers accurately predicted eclipses, measuring decade-long intervals within an accuracy of several minutes.

Lesson 7. Legacy data is necessary for following trends.

There is a tendency to be dismissive of archeologic data, due to the superabundance of more recently acquired data (See Glossary item, Data archeology). A practical way to think about the value of archeological data is that if the total amount of historical data is relatively small, the absolute value of each piece of such data is high. For example, 1990 records on temperature and precipitation may not exhibit the level of detail contained in present-day meteorological files, but the 1990 files may represent the only reliable source of climate data for the era, and it may be impossible to predict long-term climate trends without historical data. Without the availability of old data to establish baseline measurements and trends, the analysis of new data is impeded. Hence, every bit of old data has amplified importance for today's data scientists. The classic empire of the Mayans came to an abrupt ending, about 900 AD. We do not understand the reason for the collapse of Mayan civilization, but untapped clues residing in the Mayan glyphs may reveal disturbing ancient trends that presage a future catastrophe.

Lesson 8. Data worth recording is data worth saving.

Landa destroyed the Mayan libraries in 1562. The few remaining literary works of the ancient Mayans can be translated, but the vast bulk of Mayan literature is a lost legacy. Any one of those disparaged books would be a priceless treasure today.

Book burnings are a time-honored tradition enjoyed the world over by religious zealots (2). Some of the greatest books in history have been burned to a crisp. The first recorded, but least successful, book burning in history occurred around 612 B.C. and involved the library of Ashurbanipal (668 - 627 B.C.), king of the neo-Assyrian empire. Among the texts contained in the library was the Gilgamesh epic, written in about 2500 B.C. Marauders set fire to the palace and the library, with limited effect. Many of the greatest works were written on cuneiform tablets. The fire baked the clay tablets, preserving them to the present day. The Library of Alexandria was the most famous library of the ancient world. As a repository of truth and knowledge, it was a popular target. At least four major assaults punctuated the library's incendiary past: Julius Caesar in the Alexandrian War (48 B.C.), Aurelian's Palmyrine campaign (273 A.D.), the decree of Theophilus (391 A.D.) and the Muslim conquest (642 A.D.). We do not know the number of books held in the Library, but when the Alexandria library was sacked, the books provided sufficient fuel to heat the Roman baths for six months. Book burning never goes out of style. As recently as 1993, during the siege of Sarajevo, the National Library was enthusiastically burned to the ground. Thousands of irreplaceable books were destroyed in the literary equivalent of genocide.

[1] Asimov I. Isaac Asimov Mulls "How Do People Get New Ideas?" MIT Technology Review October 20, 2014.

[2] Berman JJ. Machiavelli's Laboratory. Amazon Digital Services, Inc., 2010.

- Jules Berman (copyrighted material)

key words: mayans, maya, data science, data repurposing, data reanlaysis, cryptography, cryptology, data analysis, decoding, legacy data, old data, data archeology, jules j berman

Saturday, January 30, 2016


"The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn't available." -Peter Norvig, Alon Halevy, and Ferdinand Pereira (1)

Despite the preponderance of old data, most data scientists devote their efforts to newly acquired data or to nonexistent data that may emerge in the unknowable future. Why does old data get such little respect? The reasons are manifold.

1. Much of old data is proprietary and cannot be accessed by anyone other than its owners.

2. The owners of proprietary data, in many cases, are barely aware of the contents, or even the existence of their own data, and have no understanding of the value of their holdings, to themselves or to others.

3. Old data is typically stored in formats that are inscrutable to young data scientists. The technical expertise required to use the data intelligibly is unavailable.

4. Much of old data lacks proper annotation. There simply is not sufficient information about the data (e.g., how it was collected and what the data means) to support useful analysis.

5. Much of old data, annotated or not, has not been indexed in any serious way. There is no easy method of searching the contents of old data.

6 Much of old data is poor data, collected without the kinds of quality assurances that would be required to support any useful analysis of its contents.

7. Old data is orphaned data. When data has no guardianship, the tendency is to ignore the data or to greatly underestimate its value.

The sheer messiness of old data is conveyed by the gritty jargon that permeates the field of data repurposing (Data cleaning, Data mining, Data munging, Data scraping, Data scrubbing, Data wrangling). Anything that requires munging, scraping, and scrubbing can't be too clean.

Data sources are referred to as "old" or "legacy"; neither term calls to mind vitality or robustness. A helpful way of thinking about the subject is to recognize that new data is just updated old data. New data (See Glossary item, New Data, below), without old data, cannot be used for the purpose of seeing long-term trends.

Nobody seems to put enough value on legacy data. Nobody seems to want to pay for legacy data and nobody seems to invest in preserving legacy data. The stalwart data scientist must not be discouraged. As I'll show in future blogs, preserving old data is definitely worth the bother.


New data - It is natural to think of certain objects as being "new", meaning, with no prior existence; and other objects being "old", having persisted from an earlier time, and into the present. In truth, there are very few "new" objects in our universe. Most objects arise in a continuum, through a transformation or a modification of an old object. For example, embryos are simply cellular growths that develop from pre-existing gonocytes, and the development of an embryo into a newborn organism, that is not really new at all, follows an ancient path written by combined fragments of pre-existing DNA sequences. When we speak of "new" data, alternately known as prospectively acquired data or as prospective data, we must think in terms that relate the new data to the "old" data that preceded it. For example the air temperature one minute from now is largely determined by weather events that are occurring now, and the weather occurring now is largely determined by all of the weather events that have occurred in the history of our planet. Data scientists have a pithy aphorism that captures the entangled relationship between "new" and "old" data: "Every prospective study becomes a retrospective study on day two".


[1] Norvig P, Halevy A, Pereira F. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24:8-12, 2009.

- Jules Berman (copyrighted material)

key words: data repurposing, data science, using data, data simplification, legacy data, jules j berman

Friday, January 29, 2016

Misinterpretation of Results: The Most Pervasive Error in Data Science

The most common source of scientific errors are post-analytic, arising from the interpretation of data (1), (2), (3), (4), (5), (6). Pre-analytic errors and analytic errors, though common, are much less frequently encountered than interpretation errors. Virtually every journal article contains, hidden in the introduction and discussion sections, some distortion of fact or misleading assertion. Scientists cannot be objective about their own work. As humans, we tend to interpret observations to reinforce our beliefs and prejudices and to advance our agendas.

One of the most common strategies whereby scientists distort their own results, is to contrive self-serving conclusions; a process called message framing (7). In message framing, a scientist draws the his or her preferred conclusion, omitting from their discussion any pertinent findings that might diminish or discredit their own conclusions. The common practice of message framing is conducted on a subconscious, or at least a sub-rational, level. A scientist is not apt to read articles whose conclusions contradict his own hypotheses and will not cite disputatious works. Furthermore, if a paradigm is held in high esteem by a majority of the scientists in a field, then works that contradict the paradigm are not likely to pass peer review. Hence, it is difficult for contrary articles to be published in scientific journals. In any case, the message delivered in a journal article is almost always framed in a manner that promotes the author's interpretation.

It must be noted that throughout human history, no scientist has ever gotten into any serious trouble for misinterpreting results. Scientific misconduct comes, as a rule, from the purposeful production of bad data, either through falsification, fabrication, or through the refusal to remove and retract data that is known to be false, plagiarized, or otherwise invalid. In the U.S., allegations of research misconduct are investigated by the The Office of Research Integrity (ORI). Funding agencies in other countries have similar watchdog institutions. The ORI makes its findings a matter of public record (8). Of 150 cases investigated between 1993 and 1997, all but one case had an alleged component of data falsification, fabrication or plagiarism (9). In 2007, of the 28 investigated cases, 100% involved allegations of falsification, fabrication, or both (10). No cases of misconduct based on data misinterpretation were prosecuted (11).

Post-analytic misinterpretation of data is hard-wired into the human psyche. Agencies tasked with ensuring scientific integrity have never seriously confronted the problem of data misinterpretation. Why would they? You can't fight human nature.

In 2011, amidst much fanfare, NASA scientists announced that a new form of life was found on earth, a microorganism that thrived in the high concentrations of arsenic prevalent in Mono Lake, California. The microorganism was shown to incorporate arsenic into its DNA, instead of the phosphorus used by all other known terrestrial organisms. Thus, the newfound organism synthesized a previously unknown type of genetic material (12). NASA's associate administrator for the Science Mission Directorate, at the time, wrote, "The definition of life has just expanded." (13) The Director of the NASA Astrobiology Institute at the agency's Ames Research Center in Moffett Field, California, wrote "Until now a life form using arsenic as a building block was only theoretical, but now we know such life exists in Mono Lake." (13)

Heady stuff! Soon thereafter, other scientists tried but failed to confirm the earlier findings (14). It seems that the new life form was just another old life form, and the arsenic was a hard-to-wash cellular contaminant (11). The best scientists on the planet cannot resist the lure of a scientific interpretation that promotes their own agenda.

The first analysis of data is usually wrong and irreproducible. Erroneous results and misleading conclusions are regularly published by some of the finest laboratories in the most prestigious institutions in the world (15), (16), (17), (18), (19), (20), (21), (22), (23), (24), (25), (26), (19), (27). Every scientific study must be verified and validated, and the most effective way to ensure that verification and validation take place is to release your data for public review.


[1] Ioannidis JP. Is molecular profiling ready for use in clinical decision making? The Oncologist 12:301-311, 2007.

[2] Ioannidis JP. Why most published research findings are false. PLoS Med 2:e124, 2005.

[3] Ioannidis JP. Some main problems eroding the credibility and relevance of randomized trials. Bull NYU Hosp Jt Dis 66:135-139, 2008.

[4] Ioannidis JP. Microarrays and molecular research: noise discovery? The Lancet 365:454-455, 2005.

[5] Ioannidis JP, Panagiotou OA. Comparison of effect sizes associated with biomarkers reported in highly cited individual articles and in subsequent meta-analyses. JAMA 305:2200-2210, 2011.

[6] Berman JJ. Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Morgan Kaufmann, Waltham, MA, 2013.

[7] Wilson JR. Rhetorical Strategies Used in the Reporting of Implantable Defibrillator Primary Prevention Trials. Am J Cardiol 107:1806-1811, 2011

[8] Office of Research Integrity. Available from:

[9] Scientific Misconduct Investigations. 1993-1997. Office of Research Integrity, Office of Public Health and Science, Department of Health and Human Services, December, 1998.

[10] Office of Research Integrity Annual Report 2007, June 2008. Available from:, viewed Jan. 1, 2015.

[11] Berman JJ. Repurposing Legacy Data: Innovative Case Studies. Morgan Kaufmann, Waltham, MA, 2015.

[12] Wolfe-Simon F, Switzer Blum J, Kulp TR, Gordon GW, Hoeft SE, Pett-Ridge J, et al. A Bacterium That Can Grow by Using Arsenic Instead of Phosphorus. Science 332:1163-1166, 2011.

[13] Discovery of "Arsenic-bug" Expands Definition of Life. NASA December 2, 2010.

[14] Reaves ML, Sinha S, Rabinowitz JD, Kruglyak L, Redfield RJ. Absence of arsenate in DNA from arsenate-grown GFAJ-1 cells. Science 337:470-473, 2012.

[15] Knight, J. Agony for researchers as mix-up forces retraction of ecstasy study. Nature 425:109, September 11, 2003.

[16] Hwang WS, Roh SI, Lee BC, Kang SK, Kwon DK, Kim S, et al. Patient-specific embryonic stem cells derived from human SCNT blastocysts. Science 308:1777-1783, 2005.

[17] Hajra A, Collins FS. Structure of the leukemia-associated human CBFB gene. Genomics 26:571-579, 1995.

[18] Altman LK. Falsified data found in gene studies. The New York Times October 30, 1996.

[19] Findings of scientific misconduct. NIH Guide Volume 26, Number 23, July 18, 1997 Available from:

[20] Bren L. Human Research Reinstated at Johns Hopkins, With Conditions. U.S. Food and Drug Administration, FDA Consumer magazine, September-October, 2001.

[21] Kolata G. Johns Hopkins Admits Fault in Fatal Experiment. The New York Times July 17, 2001.

[22] Brooks D. The Chosen: Getting in. The New York Times, November 6, 2005.

[23] Seward Z. MIT Admissions dean resigns; admits misleading school on credentials degrees from three colleges were fabricated, MIT says. Harvard Crimson, April 26, 2007.

[24] Salmon A, Hawkes N. Clone 'hero' resigns after scandal over donor eggs. The Times, November 25, 2005.

[25] Wilson D. Harvard Medical School in Ethics Quandary. The New York Times March 3, 2009.

[26] Findings of Scientific Misconduct. NOT-OD-05-009. November 22, 2004. Available from:

[27] Hajra A, Liu PP, Wang Q, Kelley CA, Stacy T, Adelstein RS, et al. The leukemic core binding factor -smooth muscle myosin heavy chain (CBF-SMMHC) chimeric protein requires both CBF and myosin heavy chain domains for transformation of NIH 3T3 cells. Proc Natl Acad Sci USA 92:1926-1930, 1995.

- Jules Berman (copyrighted material)

key words: data analysis, data science, misintepretation of results, distorting results, result bias, author bias, paradigm bias, data interpretation, jules j berman

Thursday, January 28, 2016

Rare Diseases: High Priority in Precision Medicine

It is very difficult to steer medical scientists away from their belief that common diseases are more important than rare diseases. Too often, scientists are persuaded by the observation that a few dozen common diseases account for the vast majority of the morbidity and mortality suffered by humans. Hence, a breakthrough in treating any of the common diseases will benefit many more people than an advance in any of the rare diseases. The reasoning seems flawless, but research targeted at the common diseases has been disappointing. In the past 50 years, most of the major advances in medicine have involved the rare diseases. Advances in the common diseases have come about as a consequence of discoveries made on rare diseases.

As it happens, the rare diseases are much easier to understand and treat than the common diseases. If we waited for medical scientists to cure the common diseases, we would miss our currently available opportunity to cure diseases, either rare or common.

Rule - Rare diseases are easier to treat than common diseases.

Brief Rationale - Rare diseases have simple genetic defects, have little heterogeneity, and have few metabolic options with which they can evade targeted treatments.

Ryanodine receptor 2 mutations are responsible for several rare arrhythmia syndromes in humans (e.g., forms of catecholaminergic polymorphic ventricular tachycardia and arrhythmogenic right ventricular dysplasia) Individuals with these disorders can be treated with drugs that stabilize the receptor. Damage to ryanodine receptor 2 seems to occur as a component of common heart failure; leading to calcium leak and arrhythmia. Preliminary studies indicate that drugs that stabilize the receptor may ameliorate all types of heart failure and the lethal arrhythmias that ensue (2). Thus, our deep understanding of a rare disease had led us to a general understanding of a common disease.

Alexion is a pharmaceutical company that specializes in developing drugs intended to treat rare diseases. For example, Alexion discovered and developed Eculizumab (trade name Soliris), a first-in-class terminal complement inhibitor. Eculizumab was approved by the FDA in 2007 for the treatment of paroxysmal nocturnal hematuria; and in 2011 for the treatment of atypical hemolytic uremic syndrome. Subsequently, eculizumab was tested for its effectiveness for several common diseases. Eculizumab was a candidate treatment for so-called dry age-related macular degeneration, a common disease; though it was not shown to be effective (3). On the brighter side, eculizumab has been shown to prevent acute and chronic rejection in certain subsets of patients who received renal transplants (4). When you have a drug that is known to target a particular member of an active physiologic pathway, it is likely to have some benefit in one or more common diseases whose clinical phenotype is due, in part, to aberrations of the same pathway.

Rule - Dugs that are safe and effective against rare diseases will be used in the treatment of one or more common diseases.

Brief Rationale - The rare diseases, as an aggregate group, comprise every possible pathogenic pathway available to cells. Hence, pathogenic pathways that are active in the common diseases will be active in one or more rare diseases. Agents that target pathways in the rare diseases are candidate treatments for the common diseases with which they share active pathways.

Wrinkled skin is one of the most common physical conditions. Every man and woman who lives long enough will wrinkle a bit. For some individuals, wrinkling is problem that merits medical attention. Botox (botulism toxin) is the drug du jour for treating wrinkles. Botox is also one of the most powerful poisons known. How did it come about that Botox emerged as a popular wrinkle treatment? Botox was original developed, tested, and approved to treat several rare diseases characerized by uncontrolled blinking. After approval was awarded, botox was found to be extremely effective for rare spasmodic conditions, including spasmodic torticollis (i.e. wry neck). In the course of treating rare diseases, it was noticed that Botox injections could temporarily erase wrinkles. The rest is history. The Botox story exemplifies how an effective treatment developed for a rare diseases can gain popularity as a treatment for a common conditions.

Rule - It is much more useful to treat a disease pathway than it is to treat the individual gene mutation or its expressed protein.

Brief Rationale - Many different diseases may respond to a drug that targets a pathogenic pathway, while only one genetic variant of one rare disease is likely to respond to a drug that targets the disease-causing gene or its expressed protein.

There is a very important lesson to be learned: Treat the pathway, not the gene. This lesson is somewhat counter-intuitive and is received with some skepticism from experienced medical researchers. Nonetheless, it is a core principal that diseases are caused by perturbed pathways, and that the successful treatment of diseases have always involved compensating, in one way or another, for pathway disturbances. Let us review some examples that demonstrate the point.

Imatinib (trade name Gleevec) inhibits tyrosine kinase, an enzyme involved in a pathway that drives the growth of various rare tumors and proliferative diseases (e.g., chronic myelogenous leukemia, gastrointestinal stromal tumor, hypereosinophilic syndrome) (5), (6), (7), (8), (9). Pathways with increased tyrosine kinase activity, and pathways whose tyrosine kinase activity is particularly sensitive to the inhibiting action of imatinib would make the best drug targets. Because Imatinib is targeted to a key protein in a general pathway that contributes to a proliferative phenotype, it has potential benefit in diseases caused by mutations in genes other than tyrosine kinase.

Bevacizumab, trade name Avastin, is an angiogenesis (i.e., vessel-forming) inhibitor (See Glossary item, Angiogenesis). All cancers require vessel growth. In theory, bevacizumab is a universal tumor growth inhibitor because its target is the non-neoplastic mesenchymal cells that form the vessels that feed growing tumor cells. Bevacizumab is employed in the treatment of common cancers, including cancers of the colon, lung, breast, kidney, ovaries, and brain (i.e., glioblastoma). Bevacizumab produces tumor shrinkage in more than half of vestibular schwannomas occurring in Neurofibromatosis 2 (10). As you might expect, Bevacizumab has its greatest value in diseases for which neovascularization has a required role in pathogenesis. Two non-cancerous diseases of vascularization, treated with angiogenesis inhibitors, are hereditary hemorrhagic telangiectasia (11), and various forms of ocular neovascularization, including common age-related macular degeneration (12).

Because pathways are interconnected, a drug that is effective against a component of a pleiotrophic pathway may be effective against multiple diseases. For example, Janus Kinase genes (e.g., AK1, JAK2, JAK3, TYK2) influence the growth and immune responsiveness in various blood cells, through their activity on cytokines. Mutations of the JAK2 gene are involved in several myeloproliferative conditions, including myelofibrosis, polycythemia vera, and at least one form of hereditary thrombocythemia (13), (14), (15).

Inhibitors of JAK genes have been approved for the treatment of various diseases that involve heightened proliferation of lymphocytes, in immune reactions, or blood cells, in myeloproliferative disorders. Ruxolitinib has been approved, in the U.S. for use in psoriasis, myelofibrois and rheumatoid arthritis (16). A host of JAK pathway inhibitors are either approved or under clinical trials for the treatment of allergic diseases, rheumatoid arthritis, psoriasis, myelofibrosis, myeloproliferative disorders, acute myeloid leukemia, and relapsed lymphoma (17). Again, specialized knowledge of rare diseases had led to generalized methods of treating a variety of related diseases, some of which are quite common.

Rule - Common diseases and rare diseases that share a pathway are likely to respond to the same pathway-targeted drug.

Brief Rationale - Pathogenesis (i.e., the biological steps that lead to disease) and clinical phenotype (i.e., the biological features that characterize a disease) are determined by cellular pathways. If a pathway has a crucial role in the development of disease, then you would have reason to hope that drugs that disrupt the pathway will alter the progression and the expression of the disease, whether the disease is common or rare.

Individuals with a rare resistance to HIV infection have a specific deletion in the gene that codes for the CCR5 co-receptor. The gene plays a role in the entry of HIV into cells; no entry, no infection. As it happens, both HIV virus and smallpox virus enhance their infectivity by exploiting a receptor, CCR5, on the surface of white blood cells. This shared mode of infection may contribute to the cross-protection against HIV that seems to come from smallpox vaccine. It has been suggested that the emergence of HIV in the 1980s may have resulted, in part, from the cessation of smallpox vaccinations in the late 1970s (18). The same, rare CCR5 gene deletion that protects against HIV infection may very well protect against smallpox infection. We may never know with certainty whether this is true because smallpox has been eradicated, along with smallpox experiments. Nonetheless, knowledge of the role of CCR5 in HIV infection has inspired the development of a new class of HIV drugs targeted against entry receptors (19).

Individuals with genetic absence of Duffy antigen receptor for chemokines (i.e., DARC, formerly known as Duffy blood group antigen) are protected from malaria cased by Plasmodium vivax. It turns out that entry of the parasite requires participation by DARC (20), (21). A new vaccine candidate for P. vivax malaria, targeted against the Duffy binding protein was developed based on observations of naturally occurring resistance in individuals lacking DARC (22), (20).

Osteoporosis-pseudoglioma syndrome is a rare disease characterized clinically by multiple bone fractures and various eye and neurologic abnormalities. It is caused by loss-of-function mutations in the low-density lipoprotein receptor-related protein-5 (LRP5). LRP5, under normal conditions, reduces the production of serotonin in the gut. Based on rare disease research directed towards understanding the role of LRP5, agents that compensate for the reduction in LRP5 by reducing gut serotonin are candidate drugs for the treatment of both rare osteoporosis-pseudoglioma syndrome, and common osteoporosis (23) (24), (25).

Of course, advances in the common diseases may have value in treating rare diseases. Losartan is an effective drug against one of the most common diseases of humans: hypertension. Losartan blocks the angiotensin II type 1 receptor, and it also blocks TGF-alpha (Transforming Growth Factor-alpha). In Marfan syndrome, a rare disease of connective tissue, growth of the aortic root may lead to life-threatening aortic aneurysm. A reduction in TGF-alpha activity, following losartan treatment, reduces growth of the aortic root, and slows the progression of aortic root distension in Marfan syndrome (26).

Shared cures for the rare diseases and the common diseases do not occur as low-probability events in an unpredictable world. Knowledge of disease biology leads us to conclude that whenever a cure for a rare disease is found, there is a high likelihood that this same cure will have practical application in the treatment of a common disease. Pharmaceutical companies understand that rare disease research and common disease research is often the prelude to common disease research (27).

It is crucially important to appreciate the role of rare diseases in drug development. If funding agencies do not appreciate how cures for the rare diseases will lead to cures for the common diseases, the field of rare disease research will continue to be under-funded and generally ignored by the medical research community.


[1] Holland Frei Cancer Medicine. Kufe D, Pollock R, Weichselbaum R, Bast R, Gansler T, Holland J, Frei E, eds. BC Decker, Ontario, Canada, 2003.

[2] Yamamoto T, Yano M, Xu X, Uchinoumi H, Tateishi H, Mochizuki M, et al. Identification of target domains of the cardiac ryanodine receptor to correct channel disorder in failing hearts. Circulation 117:762-772, 2008.

[3] Leung E, Landa G. Update on current and future novel therapies for dry age-related macular degeneration. Expert Rev Clin Pharmacol Aug 24, 2013.

[4] Legendre C, Sberro-Soussan R, Zuber J, Rabant M, Loupy A, Timsit MO, et al. Eculizumab in renal transplantation. Transplant Rev (Orlando) 27:90-92, 2013.

[5] Berman J, O'Leary TJ. Gastrointestinal stromal tumor workshop. Hum Pathol. 2001 Jun;32(6):578-82.

[6] Heinrich MC, Joensuu H, Demetri GD, Corless CL, Apperley J, Fletcher JA, et al. Phase II, Open-Label study evaluating the activity of Imatinib in treating life-threatening malignancies known to be associated with Imatinib-Sensitive Tyrosine Kinases. Clin Cancer Res 14:2717-2725, 2008.

[7] Heinrich MC, Corless CL, Demetri GD, Blanke CD, von Mehren M, Joensuu H, et al. Kinase mutations and imatinib response in patients with metastatic gastrointestinal stromal tumor. J Clin Oncol 21:4342-4349, 2003.

[8] Selvi N, Kaymaz BT, Sahin HH, Pehlivan M, Aktan C, Dalmizrak A, et al. Two cases with hypereosinophilic syndrome shown with real-time PCR and responding well to imatinib treatment. Mol Biol Rep 40:1591-1597, 2013.

[9] Cools J, DeAngelo DJ, Gotlib J, Stover EH, Legare RD, Cortes J, et al. A tyrosine kinase created by fusion of the PDGFRA and FIP1L1 genes as a therapeutic target of imatinib in idiopathic hypereosinophilic syndrome. New Eng J Med 348:1201-1214, 2003.

[10] Plotkin SR, Merker VL, Halpin C, Jennings D, McKenna MJ, Harris GJ, et al. Bevacizumab for progressive vestibular schwannoma in neurofibromatosis type 2: a retrospective review of 31 patients. Otol Neurotol 33:1046-1052, 2012.

[11] Bose P, Holter JL, Selby GB. Bevacizumab in hereditary hemorrhagic telangiectasia. N Engl J Med 360:2143-2144, 2009.

[12] Eyetech Study Group. Anti-vascular endothelial growth factor therapy for subfoveal choroidal neovascularization secondary to age-related macular degeneration: phase II study results. Ophthalmology 110:979-86, 2003.

[13] Mead AJ, Rugless MJ, Jacobsen SEW, Schuh A, Germline JAK2 mutation in a family with hereditary thrombocytosis. New Eng J Med 366:967-969, 2012.

[14] Barosi G, Bergamaschi G, Marchetti M, Vannucchi AM, Guglielmelli P, Antonioli E, et al. JAK2 V617F mutational status predicts progression to large splenomegaly and leukemic transformation in primary myelofibrosis. Blood 110:4030-4036, 2007.

[15] Zhang L, Lin X. Some considerations of classification for high dimension low-sample size data. Stat Methods Med Res. 2011 Nov 23. Available from:, viewed January 26, 2013.

[16] Mesa RA, Yasothan U, Kirkpatrick P. Ruxolitinib. Nat Rev Drug Discov 11:103-104, 2012.

[17] Pesu M, Laurence A, Kishore N, Zwillich SH, Chan G, O'Shea JJ. Therapeutic targeting of Janus kinases. Immunol Rev 223:132-42, 2008.

[18] Smallpox demise linked to spread of HIV infection. BBC News May 17, 2010.

[19] Huang Y, Paxton WA, Wolinsky SM, Neumann AU, Zhang L, He T, et al. The role of a mutant CCR5 allele in HIV-1 transmission and disease progression. Nat Med 2:1240-1243, 1996.

[20] Arevalo-Herrera M, Castellanos A, Yazdani SS, Shakri AR, Chitnis CE, Dominik R, et al. Immunogenicity and protective efficacy of recombinant vaccine based on the receptor-binding domain of the Plasmodium vivax Duffy binding protein in Aotus monkeys. Am J Trop Med Hyg 73:25-31, 2005.

[21] Miller LH, Mason SJ, Clyde DF, McGinniss MH. The resistance factor to Plasmodium vivax in blacks. The Duffy-blood-group genotype, FyFy. N Engl J Med 295:302-304, 1976.

[22] Hill AVS. Evolution, revolution and heresy in the genetics of infectious disease susceptibility. Philos Trans R Soc Lond B Biol Sci 367:840-849, 2012.

[23] Long F. When the gut talks to bone. Cell 135:795-796, 2008.

[24] Field MJ, Boat T. Rare Diseases and Orphan Products: Accelerating Research and Development. Institute of Medicine (US) Committee on Accelerating Rare Diseases Research and Orphan Product Development. 2010. The National Academics Press, Washington, D.C. Available from:

[25] Zhang W, Drake MT. Potential role for therapies targeting DKK1, LRP5, and serotonin in the treatment of osteoporosis. Curr Osteoporos Rep 10:93-100, 2012.

[26] Chiu HH, Wu MH, Wang JK, Lu CW, Chiu SN, Chen CA, et al. Losartan added to beta-blockade therapy for aortic root dilation in Marfan syndrome: a randomized, open-label pilot study. Mayo Clin Proc 88:271-276, 2013.

[27] Ayme S, Hivert V (eds.), "Report on rare disease research, its determinants in Europe and the way forward", INSERM, May 2011. Available from:, viewed February 26, 2013.

- Jules Berman (copyrighted material)

key words: rare diseases, orphan drugs, precision medicine, medical research, research funding, justification for medical research, jules j berman

Wednesday, January 27, 2016


It is difficult to rank the common infectious diseases in humans. Some organisms have a very high rate of infection, but produce a relatively low rate of clinical disease and death. Other organisms have relatively low levels of infection, but have a very high virulence, resulting in many deaths.

What follows is a listing of the most common infections that occur in humans, beginning with the organisms that can be found within the majority of humans (e.g., greater than 3.5 billion), and ending with infections involving more than one million individuals. Some infections that deserve to be included here (e.g., yellow fever) are omitted for lack of finding a trusted historical data source.

Infections occurring in the majority of humans (i.e., 3.5 to 7 billion cases).

- Demodex is a tiny mite that lives in facial skin. Demodex mites can be found in the majority of humans.

- The BK polyomavirus rarely causes disease in infected patients, and the majority of humans carry the latent virus.

- The JC polyomavirus persistently infects the majority of humans, but it is not associated with disease in otherwise healthy individuals.

Infections occurring in 1 to 3.5 billion humans.

- About two billion people (of the world's 7 billion population) have been infected with Mycobacterium tuberculosis.

- About one third of human population has been infected (i.e., about 2.3 billion people) by the only species that produces human toxoplasmosis: Toxoplasma gondii.

- Ascaris lumbricoides, the cause of ascariasis, infects about 1.5 billion people worldwide, making it the most common helminth (worm) infection of humans (1).

Infections involving 500 million to 1 billion humans.

- Various estimates would suggest that worldwide, more than half a billion people are infected with one or another subtypes of Chlamydia trachomatis. This would include the various Chlamydia organisms and serotypes that account for trachoma and chlamydial urethritis. According to the World Health Organization, there are about 37 million blind persons, worldwide. Trachoma, caused by Chlamydia trachomatis, is the number one infectious cause of blindness and accounts for about 4% of these cases. The second most common infectious cause of blindness worlwide is Onchocerca volvulus, accounting for about 1% of cases (2).

- About 200 million people are infected by schistosomes (i.e., have some form of schistosomiasis).

- Hookworms infect about 600 million people worldwide. Two species are responsible for nearly all cases of hookworm disease in humans: Ancylostoma duodenale and Necator americanus.

- Scabies is an exceedingly common, global disease, with about 300 million new cases occurring annually.

Infections involving 100 million to 500 million humans.

- Hepatitis B infects more than 200 million people, worldwide, causing two million deaths each year.

- Bubonic plague is credited with killing one third of the population of Europe in the mid-1300s. Altogether, bubonic plague is estimated to have caused about 200 million deaths. In modern times, plaque is rare, but not extinct. Each year, several thousand cases of plague occur worldwide, resulting in several hundred deaths. Virtually all of the contemporary cases occur in Africa.

- Genus Plasmodium is responsible for human and animal malaria. About 300 - 500 million people are infected with malaria, worldwide, causing 2 million deaths each year (3) (4).

- About 150 million people are infected by the filarial nematodes (genera Brugia, Loa, Onchocerca, Mansonella, and Wuchereria) (5). Wuchereria bancroft and Brugia malayi, together, infect about 120 million individuals (5). Most cases occur in Africa and Asia.

- Smallpox is reputed to have killed about 300 million people in the twentieth century, prior to the widespread availability of an effective vaccine. Smallpox, now extinct, has been referred to as the greatest killer in human history.

- Worldwide, about 100 million cases of acute diarrhea are caused by rotavirus. In 2004, rotavirus infections accounted for about a half million deaths in young children, from severe diarrhea (6).

Infections involving 10 million to 100 million humans.

- The 1917-1918 influenza pandemic caused somewhere between 50 million and 100 million deaths. Seasonal influenza kills between a quarter million and a half million people worldwide, each year. In the U.S., seasonal influenza accounts for about 40,000 deaths annually.

- It is estimated that about 50 million people are infected by Entamoeba histolytica, with about 70,000 deaths per year, worldwide.

- Paragonimus westermani, along with dozens of less frequent species within Genus Paragonimus, causes the condition known as paragonomiasis. About 22 million people are infected worldwide, with most cases occurring in Southeast Asia, Africa, and South America.

- More than 50 million dengue virus infections occur each year, causing about 25,000 deaths worldwide. Most infections are asymptomatic or cause only mild disease. A minority of cases are severe.

- Between 1918 and 1922, epidemic, louse-borne, typhus (Rickettsia prowazekii) infected 30 million people, in Eastern Europe and Russia, accounting for about three million deaths (7).

- The most common sexually transmitted disease is trichomoniasis (Class Metamonada), with about 8 million new cases each year in North America (8). The second most common sexually transmitted disease is chlamydia (Class Chlamydiae), with about 4 million new cases each year in North American (8). Approximately 1.5 million new cases of gonorrhea occur annually in North America, where gonorrhea is the third most common sexually transmitted disease (8). According to the U.S. Centers for Disease Control and Prevention, there were about 46,000 new cases of syphilis and 48,000 new cases of HIV reported in the U.S., in 2011 (9).

- Leprosy, also known as Hansen disease, is caused by Mycobacterium leprae and Mycobacterium lepromatosis. From the 1960s to the 1980s, the number of leprosy cases worldwide was a steady 10-12 million (10) The introduction of effective multidrug protocols has resulted in many cured cases and in lowered infection rates. Consequently, the number of cases of leprosy has dropped to about 5.5 million worldwide in the 1990s (10). In 2005, there were about 300,000 new cases reported, worldwide (11).

- Leishmania species cause leishmaniasis, a disease that infects about 12 million people worldwide. Each year, about 60,000 people die from the visceral form of the disease.

- Trypanosoma cruzi is the cause of Chagas disease, also known as American trypanosomiasis. Chagas disease affects about eight million people (12). Trypanosoma brucei is the cause of African trypanosomiasis (sleeping sickness). The reported numbers of cases are may be somewhat unreliable, but it has been estimated that infection with Trypanosoma brucei accounts for about 50,000 deaths each year.

- Fasciolopsiasis is caused by Fasciolopsis buski, a large (us to 7.5 cm. length) fluke that lives in the intestines of the primary host, pigs and humans (See Glossary item, Primary host). The number of humans infected is about 10 million.

- Clonorchis sinensis, the Chinese liver fluke (also known as the Oriental liver fluke) infects about 30 million people.

- In the year 2000, measles caused approximately nearly 40 million illnesses and about 750,000 deaths worldwide (13).

What kinds of organisms cause infections? All types. It seems to be a condition of terrestrial life that organisms live within one another. Humans can be infected by any of the classical kingdoms of living organisms: Bacteria, Fungi, Animalia, Plantae, and members of the kingdom formerly known as Protoctista, containing the protozoan parasites. Humans can also be infected by non-living organisms (i.e., virus, prion). In my book "Taxonomic guide to infectious diseases: understanding the biologic classes of pathogenic organisms," most of the known infections of humans are described, with their culpable organisms assigned to their proper phylogenetic classes (14).


[1] Crompton DW. How much human helminthiasis is there in the world? J Parasitol 85:397-403, 1999.

[2] Resnikoff S, Pascolini D, Etyaale D, Kocur I, Pararajasegaram R, Pokharel GP, et al. Global data on visual impairment in the year 2002. Bulletin of the World Health Organization 2004;82:844-851, 2004.

[3] The state of world health. Chapter 1 in World Health Report 1996. World Health Organization. Available from: 1996.

[4] Lemon SM, Sparling PF, Hamburg MA, Relman DA, Choffnes ER, Mack A. Vector-Borne Diseases: Understanding the Environmental, Human Health, and Ecological Connections, Workshop Summary. Institute of Medicine (US) Forum on Microbial Threats. Washington (DC): National Academies Press (US). 2008.

[5] Foster J, Ganatra M, Kamal I, Ware J, Makarova K, Ivanova N, et al. The Wolbachia genome of Brugia malayi: endosymbiont evolution within a human pathogenic nematode. PLoS Biol 3:e121, 2005.

[6] Weekly epidemiological record. World Health Organization. 32:285-296, 2007.

[7] Cowan G. Rickettsial diseases: the typhus group of fevers: a review. Postgrad Med J 76:269-272, 2000.

[8] Global prevalence and incidence of selected curable sexually transmitted infections: overview and estimates. World Health Organization. Geneva. 2001.

[9] Sexually transmitted diseases. U.S. Centers for Disease Control and Prevention. Available from, viewed October 24, 2013.

[10] Noordeen SK, Lopez Bravo L, Sundaresan TK. Estimated number of leprosy cases in the world. Bull World Health Organ 70:7-10, 1992.

[11] World Health Organization. Global leprosy situation. Weekly Epidemiological Record 81:309-16, 2006.

[12] Rassi A Jr, Rassi A, Marin-Neto JA. Chagas disease. Lancet 375:1388-1402, 2010.

[13] Stein CE, Birmingham M, Kurian M, Duclos P, Strebel P. The global burden of measles in the year 2000: a model that uses country-specific indicators. J Infect Dis 187:S8, 2003.

[14] Berman JJ. Taxonomic Guide to Infectious Diseases: Understanding the Biologic Classes of Pathogenic Organisms. Academic Press, Waltham, 2012.

- Jules Berman (copyrighted material)

key words: infectious diseases, incidence of disease, infection, jules j berman