Sunday, March 6, 2016

Data Simplification: Identifiers


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.


"I always wanted to be somebody, but now I realize I should have been more specific." -Lily Tomlin

An object identifier is anything associated with the object that persists throughout the life of the object and that is unique to the object (i.e., does not belong to any other object). Everyone is familiar with biometric identifiers, such as fingerprints, iris patterns, and genome sequences. In the case of data objects, the identifier usually refers to a randomly chosen long sequence of numbers and letters that is permanently assigned to the object and which is never assigned to any other data object.

An identifier system is a set of data-related protocols that satisfy the following conditions: 1) Completeness (i.e., every unique object has an identifier); 2) Uniqueness (i.e., each identifier is a unique sequence); 3. Exclusivity (i.e., each identifier is assigned to only one unique object, and to no other object, ever); 4) Authenticity (i.e., objects that receive identification can be verified as the objects that they are intended to be); 5) Aggregation (all information associated with an identifier can be collected); and 6) Permanence (i.e., an identifier is never deleted).

Uniqueness is a very strange concept, especially when applied to the realm of data. For example, if I refer to the number 1, then I am referring to a unique number among other numbers (i.e., there is only one number 1). Yet the number 1 may apply to many different things (i.e., 1 left shoe, 1 umbrella, 1 prime number between 2 and 5). The number 1 makes very little sense to us until we know something about what it measures (e.g., left shoe) and the object to which the measurement applies (e.g., shoe_id_#840354751) (1).

We refer to uniquely-assigned computer-generated character strings as "identifiers". As such, computer-generated identifiers are abstract constructs that do not need to embody any of the natural properties of the object. A long (e.g., 200 character length) character string, consisting of randomly chosen numeric and alphabetic characters is an excellent identifier, because the chances of two individuals being assigned the same string is essentially zero. When we need to establish the uniqueness of some object, such as a shoe or a data record, we bind the object to a contrived identifier.

Jumping ahead just a bit, if we say "part number 32027563 weighs 1 pound," then we are dealing with a meaningful assertion . The assertion tells us three things: 1) that there is a unique thing, known as part number 32027563, 2) the unique thing has a weight, and 3) the weight has a measurement of 1 pound. The phrase "weighs 1 pound" has no meaning until it is associated with a unique object (i.e., part number 32027563 weighs 1 pound). The assertion that "part number 32027563 weighs 1 pound" is a "triple," the embodiment of meaning in the field of computational semantics. A triple consists of a unique, identified object, matched to a pair of data and metadata (i.e., a data element and the description of the data element). Information experts use formal syntax to express triples as data structures.

Returning to the issue of object identification, there are various methods for generating and assigning unique identifiers to data objects (2), (3), (4), (5). Some identification systems assign a group prefix to an identifier sequence that is unique for the members of the group. For example, a prefix for a research institute may be attached to every data object generated within the institute. If the prefix is registered in a public repository, data from the institute can be merged with data from other institutes, and the institutional source of the data object can always be determined. The value of prefixes, and other reserved namespace designations, can be undermined when implemented thoughtlessly (1).

Identifiers are data simplifiers, when implemented properly, because they allow us to collect all of the data associated with a unique object, while ensuring that we exclude that data that should be associated with some other object.

UUID (Universally Unique IDentifier) is an example of one type of algorithm that creates collision-free identifiers that can be generated on command, at the moment when new objects are created (i.e., during the run-time of a software application). Linux systems have a built-in UUID utility, "uuidgen.exe", that can be called from the system prompt.

Here are a few examples of output values generated by the "uuidgen.exe" utility:
$ uuidgen.exe
312e60c9-3d00-4e3f-a013-0d6cb1c9a9fe

$ uuidgen.exe
822df73c-8e54-45b5-9632-e2676d178664

$ uuidgen.exe
8f8633e1-8161-4364-9e98-fdf37205df2f

$ uuidgen.exe
83951b71-1e5e-4c56-bd28-c0c45f52cb8a

$ uuidgen -t
e6325fb6-5c65-11e5-b0e1-0ceee6e0b993

$ uuidgen -r
5d74e36a-4ccb-42f7-9223-84eed03291f9
Data Simplification: Taming Information With Open Source Tools describes simple implementions of UUID utilities under Windows.

Notice that each of the final two examples have a parameter added to the "uuidgen" command (i.e., "-t" and "-r"). There are several versions of the UUID algorithm that are available. The "-t" parameter instructs the utility to produce a UUID based on the time (measured in seconds elapsed since the first second of October 15, 1582, the start of the Gregorian calendar). The "-r" parameter instructs the utility to produce a UUID based on the generation of a pseudorandom number. In any circumstance, the UUID utility produces a fixed length character string suitable as an object identifier. The UUID utility is trusted and widely used by computer scientists.

References:

[1] Berman JJ. Repurposing Legacy Data: Innovative Case Studies. Morgan Kaufmann, Waltham, MA, 2015.

[2] Leach P, Mealling M, Salz R. A Universally Unique IDentifier (UUID) URN Namespace. Network Working Group, Request for Comment 4122, Standards Track. Available from: http://www.ietf.org/rfc/rfc4122.txt, viewed Jan. 1, 2015.

[3] Mealling M. RFC 3061. A URN Namespace of Object Identifiers. Network Working Group, 2001. Available from: https://www.ietf.org/rfc/rfc3061.txt, view Jan. 1, 2015.

[4] Berman JJ. Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Morgan Kaufmann, Waltham, MA, 2013.

[5] Berman JJ. Methods in Medical Informatics: Fundamentals of Healthcare Programming in Perl, Python, and Ruby. Chapman and Hall, Boca Raton 2010.


- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, identifiers, jules j berman