Saturday, January 30, 2010

One-way hash: Perl, Python, Ruby

I have prepared short scripts , in Perl, Python, and Ruby, for implementing one-way hash operations. One-way hashes are extremely important in medical informatics. The following text is extracted from a public domain document that I wrote, in 2002 (1).

A one-way hash is an algorithm that transforms a string into another string is such a way that the original string cannot be calculated by operations on the hash value (hence the term "one-way" hash). Examples of public domain one-way hash algorithms are MD5 and SHA (Standard Hash Algorithm). These differ from encryption protocols that produce an output that can be decrypted by a second computation on the encrypted string.

The resultant one-way hash values for text strings consist of near-random strings of characters, and the length of the strings (e.g. the strength of the one-way hash) can be made arbitrarily long. Therefore name spaces for one-way hashes can be so large that the chance of hash collisions (two different names or identifiers hashing to the same value) is negligible. For the fussy among us, protocols can be implemented guaranteeing a data set free of hash-collisions, but such protocols may place restrictions upon the design of the data set (e.g. precluding the accrual of records to the data set after a certain moment)

In theory, one-way hashes can be used to anonymize patient records while still permitting researchers to accrue data over time to a specific patient' record. If a patient returns to the hospital and has an additional procedure performed, the record identifier, when hashed, will produce the same hash value held by the original data set record. The investigator simply adds the data to the "anonymous" data set record containing the same one-way hash value. Since no identifier in the experimental data set record can be used to link back to the patient, the requirements for anonymization, as stipulated in the E4 exemption are satisfied (vida supra).

There is no practical algorithm that can take an SHA hash and determine the name (or the social security number or the hospital identifier, or any combination of the above) that was used to produce the hash string. In France, the name-hashed files are merged with files from many different hospitals and used in epidemiologic research. They use the hash-codes to link patient-data across hospitals. Their methods have been registered with SCSSI (Service Central de la Securite des Systemes d'information).

Implementation of one-way hashes creates some practical problems. Attacks on one-way hash data may take the form of hashing a list of names and looking for matching hash values in the data set. This can be solved by encrypting the hash or by hashing a secret combination of identifier elements or both or keeping the hash value private (hidden). Issues arise related to the multiple ways that a person may be identified within a hospital system (Tom Peterson on Monday, Thomas Peterson on Tuesday), all resulting on inconsistent hashes on a single person. Resolving these problems is an interesting area for further research.

The scripts are available at:

1. Berman JJ. Confidentiality for Medical Data Miners. Artificial Intelligence in Medicine 26:25-36, 2002.

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 year by Morgan Kaufmann.

I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, encryption, security, informatics, perl programming, ruby programming, python programming, jules j berman, md5, sha, 1-way hash, 1way hash, oneway hash, confidentiality, de-identification