Monday, March 7, 2016

DATA SIMPLIFICATION: Poor Identifiers, Horrific Consequences

Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.



All information systems, all databases, and all good collections of data are best envisioned as identifier systems to which data (belonging to the identifier) can be added over time.

If the system is corrupted (e.g., multiple identifiers for the same object, data belonging to one object incorrectly attached to other objects), then the system has no value. You can't trust any of the individual records, and you can't trust any of the analyses performed on collections of records. Furthermore, if the data from a corrupted system is merged with the data from other systems, then all analyses performed on the aggregated data becomes unreliable and useless. This holds true even when every other contributor to the system shares reliable data.

Without proper identifiers, the following may occur: data values can be assigned to the wrong data objects; data objects can be replicated under different identifiers, with each replicant having an incomplete data record (i.e., an incomplete set of data values); the total number of data objects cannot be determined; data sets cannot be verified; and the results of data set analyses will not be valid.

In the past, individuals were identified by their name. When dealing with large numbers of names, it becomes obvious, almost immediately, that personal names are woefully inadequate. In a review of a population of 3.5 million, there occurred nearly 250,000 instances wherein individuals shared the same first and last name; there were 70,000 instances where two people shared the same first name, last name and birthdate (1)!

Aside from the obvious fact that they are not unique (e.g., surnames such as Smith, Zhang, Garcia, Lo, and given names such as John and Susan), one name can have multiple representations. The sources for these variations are many. Here is a partial listing (2):

1. Modifiers to the surname (du Bois, DuBois, Du Bois, Dubois, Laplace, La Place, van de Wilde, Van DeWilde, etc.).

3. Accents that may or may not be transcribed onto records (e.g., acute accent, cedilla, diacritical comma, palatalized mark, hyphen, diphthong, umlaut, circumflex, and a host of obscure markings).

4. Special typographic characters (the combined "ae").

5. Multiple "middle names" for an individual, that may not be transcribed onto records. Individuals who replace their first name with their middle name for common usage, while retaining the first name for legal documents.

6. Latinized and other versions of a single name (Carl Linnaeus, Carl von Linne, Carolus Linnaeus, Carolus a Linne).

7. Hyphenated names that are confused with first and middle names (e.g., Jean-Jacques Rousseau, or Jean Jacques Rousseau; Louis-Victor-Pierre-Raymond, 7th duc de Broglie, or Louis Victor Pierre Raymond Seventh duc deBroglie).

8. Cultural variations in name order that are mistakenly re-arranged when transcribed onto records. Many cultures do not adhere to the Western European name order (e.g., given name, middle name, surname).

9. Name changes; through marriage, legal action, aliasing, pseudonymous posing, or insouciant whim.

I have had numerous conversations with intelligent professionals who are tasked with the responsibility of assigning identifiers to individuals. At some point in every conversation, they will find it necessary to explain that although an individual's name cannot serve as an identifier, the combination of name plus date of birth provides accurate identification in almost every instance. They sometimes get carried away, insisting that the combination of name plus date of birth plus social security number provides perfect identification, as no two people will share all three identifiers: same name, same date of birth, same social security number. This is simply wrong. Let us see what happens when we create identifiers from the name plus birthdate.

Consider this example. Mary Jessica Meagher, born June 7, 1912 decided to open a separate bank account in each of 10 different banks. Some of the banks had application forms, which she filled out accurately. Other banks registered her account through a teller, who asked her a series of questions and immediately transcribed her answers directly into a computer terminal. Ms. Meagher could not see the computer screen and could not review the entries for accuracy.

Here are the entries for her name plus date of birth (1):

1. Marie Jessica Meagher, June 7, 1912 (the teller mistook Marie for Mary).

2. Mary J. Meagher, June 7, 1912 (the form requested a middle initial, not name).

3. Mary Jessica Magher, June 7, 1912 (the teller misspelled the surname).

4. Mary Jessica Meagher, Jan 7, 1912 (the birth month was constrained, on the form, to three letters; Jun, entered on the form, was transcribed as Jan).

5. Mary Jessica Meagher, 6/7/12 (the form provided spaces for the final two digits of the birth year. Through the miracle of bank registration, Mary, born in 1912, was re-born a century later).

6. Mary Jessica Meagher, 7/6/2012 (the form asked for day, month, year, in that order, as is common in Europe).

7. Mary Jessica Meagher, June 1, 1912 (on the form, a 7 was mistaken for a 1).

8. Mary Jessie Meagher, June 7, 1912 (Marie, as a child, was called by the informal form of her middle name, which she provided to the teller).

9. Mary Jessie Meagher, June 7, 1912 (Marie, as a child, was called by the informal form of her middle name, which she provided to the teller, and which the teller entered as the male variant of the name).

10. Marie Jesse Mahrer, 1/1/12 (an underzealous clerk combined all of the mistakes on the form and the computer transcript, and added a new orthographic variant of the surname).

For each of these ten examples, a unique individual (Mary Jessica Meagher) would be assigned a different identifier at each of 10 banks. Had Mary re-registered at one bank, ten times, the results may have been the same.

References:

[1] McCann E. The patient identifier debate: Will a national patient ID system ever materialize? Should it? Healthcare IT News February 18, 2013.

[2] Berman JJ. Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Morgan Kaufmann, Waltham, MA, 2013.


- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, identifiers, jules j berman