Wednesday, March 9, 2016

DATA SIMPLIFICATION: Specifications to the Rescue!


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.


Today's blog continues yesterday's discussion of Standards and Specifications

Despite the problems inherent in standards, government committees cling to standards as the best way to share data. The perception is that in the absence of standards, the necessary activities of data sharing, data verification, data analysis, and any meaningful validation of the conclusions will be impossible to achieve (1). This long-held perception may not be true. Data standards, intended to simplify our ability to understand and share data, may have increased the complexity of data science. As each new standard is born, our ability to understand our data seems to diminish. Luckily, many of the problems produced by the proliferation of data standards can be avoided by switching to a data annotation technique broadly known as "specification." Although the terms "specification" and "standard" are used interchangeably, by the incognoscenti, the two terms are quite different from one another. A specification is a formal way of describing data. A standard is a set of requirements, created by an standards development organization, that comprise a pre-determined content and format for a set of data.

A specification is an accepted method for describing objects (physical objects such as nuts and bolts; or symbolic objects, such as numbers; or concepts expresed as text). In general, specifications do not require explicit items of information (i.e. they do not impose restrictions on the content that is included in or excluded from documents), and specifications do not impose any order of appearance of the data contained in the document (i.e., you can mix up and rearrange the data records in a specification if you like). Specifications are not generally certified by a standards organization. Examples of specifications are RDF (Resource Description Framework) produced by the W3C (WorldWide Web Consortium), and TCP/IP (Transfer Control Protocol/Internet Protocol), maintained by the Internet Engineering Task Force. The most widely implemented specifications are simple; thus, easily adopted.

Specifications proved a simple and uniform way of representing the information you choose to include in your reports, messages, and files. Some of the most useful and popular specifications are XML, RDF, Notation 3, and Turtle. In general, specifications do not require explicit items of information (i.e. they do not impose restrictions on the content that is included in or excluded from documents), and specifications do not impose any order of appearance of the data contained in the document (i.e., you can mix up and rearrange the data records in a specification if you like). Specifications are not typically certified by a standards organization, but they are developed by special interest groups. Their legitimacy depends on their popularity.

Files that comply with a specification can be parsed and manipulated by generalized software designed to parse the markup language of the specification (e.g., XML, RDF) and to organize the data into data structures defined within the file.

Specifications serve most of the purposes of a standard, plus providing many important functions that standards typically lack (e.g., full data description, data exchange across diverse types of data sets, data merging, and semantic logic). Data specifications spare us most of the heavy baggage that comes with a standard, which includes: limited flexibility to include changing data objects, locked-in data descriptors, licensing and other intellectual property issues, competition among standards that compete within the same data domain, and bureaucratic overhead (2).

Most importantly, specifications make standards fungible. A good specification can be ported into a data standard, and a reasonably good data standard can be ported into a specification. For example, there are dozens of image formats (e.g., jpeg, png, gif, tiff). Although these formats have not gone through a standards development process, they are used by billions of individuals and have achieved the status as de facto standards. For most of us, the selection of any particular image format is inconsequential. Data scientists have access to robust image software that will convert images from one format to another.

A common mistake committed by data scientists is to convert all their data (legacy data and newly acquired data) into a contemporary standard, and then relying on analytic software that is designed to operate exclusively upon the chosen standard. Doing so only serves to perpetuate their frustrations. You can be certain that your data standard and your software application will be unsuitable for the next generation of data scientists. It makes much more sense to port data into a general specification, from which data can be ported to any current or future data standard.

References:

[1] National Committee on Vital and Health Statistics. Report to the Secretary of the U.S. Department of Health and Human Services on Uniform Data Standards for Patient Medical Record Information. July 6, 2000. Available from: http://www.ncvhs.hhs.gov/hipaa000706.pdf

[2] Berman JJ. Repurposing Legacy Data: Innovative Case Studies. Morgan Kaufmann, Waltham, MA, 2015.

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, standards, specifications, semantic web, jules j berman

No comments: