Sunday, January 6, 2008

Deidentification with one-way hash algorithms

A one-way hash is an algorithm that transforms a string into another string is such a way that the original string cannot be calculated by operations on the hash value (hence the term "one-way" hash). Examples of public domain one-way hash algorithms are MD5 and SHA (Standard Hash Algorithm) [1,2]. These differ from encryption protocols that produce an output that can be decrypted by a second computation on the encrypted string.

The resultant one-way hash values for text strings consist of near-random strings of characters, and the length of the strings (e.g. the strength of the one-way hash) can be made arbitrarily long. Therefore name spaces for one-way hashes can be so large that the chance of hash collisions (two different names or identifiers hashing to the same value) is negligible. For the fussy among us, protocols can be implemented guaranteeing a dataset free of hash-collisions, but such protocols may place restrictions upon the design of the dataset (e.g. precluding the accrual of records to the dataset after a certain moment)

In theory, one-way hashes can be used to anonymize patient records while still permitting researchers to accrue data over time to a specific patient' record. If a patient returns to the hospital and has an additional procedure performed, the record identifier, when hashed, will produce the same hash value held by the original dataset record. The investigator simply adds the data to the "anonymous" dataset record containing the same one-way hash value. Since no identifier in the experimental dataset record can be used to link back to the patient, the requirements for anonymization, as stipulated in the E4 exemption are satisfied (vida supra).

The use of one-way hashes to anonymize patient records has been employed and promoted in France. Quantin and Bouzelat have standardized a protocol for coding names using SHA one-way hashes [3]. There is no practical algorithm that can take an SHA hash and determine the name (or the social security number or the hospital identifier, or any combination of the above) that was used to produce the hash string. In France, the name-hashed files are merged with files from many different hospitals and used in epidemiologic research. They use the hash-codes to link patient-data across hospitals.

Implementation of one-way hashes carry certain practical problems. Attacks on one-way hash data may take the form of hashing a list of names and looking for matching hash values in the dataset. This can be solved by encrypting the hash or by hashing a secret combination of identifier elements or both or keeping the hash value private (hidden). Issues arise related to the multiple ways that a person may be identified within a hospital system (Tom Peterson on Monday, Thomas Peterson on Tuesday), all resulting on inconsistent hashes on a single person. Resolving these problems is an interesting area for further research.

1. R. Rivest, Request for Comments: 1321, The MD5 Message-Digest Algorithm
http://theory.lcs.mit.edu/~rivest/Rivest-MD5.txt

2. World Wide Web Consortium. SHA-1 Digest.
http://www.w3.org/TR/1998/REC-DSig-label/SHA1-1_0

3. H. Bouzelat, C. Quantin, L. Dusserre. Extraction and anonymity protocol of medical file. Proc AMIA Annu Fall Symp (1996) 323-327.

See also my article on one-way hash issues under HIPAA.

-Jules J. Berman
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, anonymization, authentication, confidentiality, medical records