Most deidentification software (for medical records) does not actually deidentify records. The software simply reduces the number of identifiers in the records. This goes far to explain the complete absence of de-identified public domain medical record datasets, prepared by automatic deidentifiers and made available on the Web.
A large corpus of so-called de-identified records is certain to contain some HIPAA identifiers. In the U.S., the language in HIPAA would indicate that if you're reasonably certain that the data cannot be used to identify a protected individual, the data is exempted from HIPAA restrictions. But if you start off knowing that the data contains some HIPAA identifiers, you lose the reasonable expectation that no patient can be re-identified from the data!
What then, is the value of automatic de-identifiers that do not fully de-identify? In the U.S., HIPAA permits two methods by which an IRB can allow data that is not fully deidentified to be used for a variety of purposes. First is the Waiver. This permits the IRB to allow the uses of the data if certain conditions are met. Second is the Limited Use agreement, which permits a specified partner to receive data that is not fully de-identified, under certain conditions. In the U.S., the Common Rule, which applies to human subject research, permits IRB Waivers under a very similar set of conditions.
Even when the corpus of records is shared under a Waiver or a Limited Use agreement, the data must conform to the Minimum Necessary provision in HIPAA. When using identified information for permitted purposes, HIPAA requires that only the minimal amount of information needed for the purpose is disclosed (see the HIPAA excerpt, below). This would imply that information unrelated to research goals but included in medical reports, must be removed prior to transferring the reports to external covered entities. The Minimum Necessary provision applies to information other than identifying information.
-Section 164.514(d)--Minimum Necessary "covered entities must make reasonable efforts to use or disclose or to request from another covered entity, only the minimum amount of protected health information required to achieve the purpose of a particular use or disclosure."
Most de-identifier software cannot, in any way, help a data holder comply with the Minimum Necessary provision. However, the Concept-Match method, which blocks all text except for phrases that match terms contained in a medical nomenclature (such as the UMLS) or high frequency words (when, if, can, are, the, etc.) will block all text except for the "Minumum Necessary". To a somewhat lesser extent, the "doublet method" will do the same.
The strengths and limitations of the various types of de-identifiers now available as open source sotware are discussed in Biomedical Informatics and in Ruby Programming for Medicine and Biology.
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.
I urge you to explore my book. Google books has prepared a generous preview of the book contents.
tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, deidentification, hipaa, minimum necessary