Tuesday, March 8, 2016

DATA SIMPLIFICATION: Substandard Standards

Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.

Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.

"The nice thing about standards is that you have so many to choose from." -Andrew S. Tanenbaum

Data standards are the false gods of informatics. They promise miracles, but they can't deliver. The biggest drawback of standards is that they change all the time. If you take the time to read some of the computer literature from the 1970s or 1980s, you will come across the names of standards that have long-since fallen into well-deserved obscurity. You may find that the literature from the 1970s is nearly impossible to read with any level of comprehension, due to the large number of now-obsolete standards-related acronyms scattered through every page. Today's eternal standard is tomorrow's indecipherable gibberish (1).

The Open Systems Interconnection (OSI) was an internet protocol created in 1977 with approval from the International Organization for Standardization. It has been supplanted by TCP/IP, the protocol that everyone uses today. A hand full of programming languages have been recognized as standards by the American National Standards Institute. These include Basic, C,, Ada and Mumps. Basic and C are still popular languages. Ada, recommended by the Federal Government, back in 1995, as the recommended language for all high performance software applications, is virtually forgotten (2). Mumps is still in use, particularly in hospital information systems, but it changed its name to M, lost its allure to a new generation of programmers, and now comes in various implementations that may not strictly conform to the original standard.

In many cases, as a standard matures, it becomes hopelessly complex. As the complexity becomes unmanageable, those who profess to use the standard may develop their own idiosyncratic implementations. Organizations that produce standards seldom provide a mechanism to ensure that the standard is implemented correctly. Standards have long been plagued by non-compliance or (more frequently) under-compliance. Over time, so-called standard-compliant systems tend to become incompatible with one another. The net result is that legacy data, purported to conform to a standard format, is no longer understandable.

Regarding versioning, it is a very good rule of thumb that when you encounter a standard whose name includes a version number (e.g. International Classification of Diseases-10, Diagnostic and Statistical Manual of Mental Disorders-5), you can be certain that the standard is unstable, and must be continually revised. Some continuously revised standards cling tenaciously to life, when they really deserve to die. In some cases, a poor standard is kept alive indefinitely by influential leaders in their fields, or by entities who have an economic stake in perpetuating the standard.

Raymond Kammer, then Director of the U.S. National Institute of Standards and Technology, understood the down-side of standards. In a year 2000 government report, he wrote that "the consequences of standards can be negative. For example, companies-and nations-can use standards to disadvantage competitors. Embodied in national regulations, standards can be crafted to impede export access, sometimes necessitating excessive testing and even redesigns of products. A 1999 survey by the National Association of Manufacturers reported that about half of U.S. small manufacturers find international standards or product certification requirements to be barriers to trade. And according to the Transatlantic Business Dialogue, differing requirements add more than 10% to the cost of car design and development." (3)

As it happens, data standards are seldom, if ever, implemented properly. In some cases, the standards are simply too complex to comprehend. Try as they might, every implementation of a complex standard is somewhat idiosyncratic. Consequently, no two implementations of a complex data standard are equivalent to one another. In many cases, corporations and government agencies will purposefully veer from the standard to accommodate some local exigency. In some cases, a corporation may find it prudent to include non-standard embellishments to a standard to create products or functionalities that cannot be easily reproduced by their competitors. In such cases, customers accustomed to a particular manufacturer's rendition of a standard may find it impossible to switch providers).

The process of developing new standards is costly. Interested parties must send representatives to many meetings. In the case of international standards, meetings occur in locations throughout the planet. Someone must pay for the expertise required to develop the standard, improve drafts, and vet the final version. Standards development agencies become involved in the process, and the end-product must be shepherded through one of the agencies that confer final approval. After a standard is approved, it must be accepted by its intended community of users. Educating a community in the use of a standard is another expense. In some cases, an approved standard never gains traction. Because standards cost a great deal of money to develop, it is only natural that corporate sponsors play a major role in the development and deployment of new standards. Software vendors are clever and have learned to benefit from the standards-making process. In some cases, members of a standards committee may knowingly insert a fragment of their own patented property into the standard. After the standard is released and implemented, in many different vendor systems, the patent holder rises to assert the hidden patent. In this case, all those who implemented the standard may find themselves required to pay a royalty for the use of intellectual property sequestered within the standard (4).

Corporations can profit from standards by obtaining patents on the uses of the standard; not on the patent itself. For example, an open standard may have been created that can be obtained at no cost, and that is popular among its intended users, and that contains no hidden intellectual property. An interested corporation or individual may discover a use for the standard that is non-obvious, novel,and useful; these are the three criteria for awarding patents. The corporation or individual can patent the use of the standard, without needing to patent the standard itself. The patent holder will have the legal right to assert the patent over anyone who uses the standard for the purpose claimed by the patent. This patent protection will apply even when the standard is free and open (4).

Despite the problems inherent in standards, government committees cling to standards as the best way to share data. The perception is that in the absence of standards, the necessary activities of data sharing, data verification, data analysis, and any meaningful validation of the conclusions will be impossible to achieve (5). This long-held perception may not be true. Data standards, intended to simplify our ability to understand and share data, may have increased the complexity of data science. As each new standard is born, our ability to understand our data seems to diminish. Luckily, many of the problems produced by the proliferation of data standards can be avoided by switching to a data annotation technique broadly known as "specification." Although the terms "specification" and "standard" are used interchangeably, by the incognoscenti, the two terms are quite different from one another. A specification is a formal way of describing data. A standard is a set of requirements, created by an standards development organization, that comprise a pre-determined content and format for a set of data.

More on specifications in tomorrow's blog.


[1] Berman JJ. Repurposing Legacy Data: Innovative Case Studies. Morgan Kaufmann, Waltham, MA, 2015.

[2] FIPS PUB 119-1. Supersedes FIPS PUB 119. 1985 November 8. Federal Information Processing Standards Publication 119-1 1995 March 13. Announcing the Standard for ADA. Available from: http://www.itl.nist.gov/fipspubs/fip119-1.htm, viewed August 26, 2012.

[3] Kammer RG. The Role of Standards in Today's Society and in the Future. Statement of Raymond G. Kammer, Director, National Institute of Standards and Technology, Technology Administration, Department of Commerce, Before the House Committee on Science Subcommittee on Technology, September 13, 2000.

[4] Berman JJ. Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. Morgan Kaufmann, Waltham, MA, 2013.

[5] National Committee on Vital and Health Statistics. Report to the Secretary of the U.S. Department of Health and Human Services on Uniform Data Standards for Patient Medical Record Information. July 6, 2000. Available from: http://www.ncvhs.hhs.gov/hipaa000706.pdf

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, standards, specificationsjules j berman