I have just uploaded the paper that fully describes the concept-match method for medical record de-identification. This version is modified from the original publication with URL updates that correctly link to currently available supplementary resources.
The properties of the concept match method are:
It produces an output devoid of phrases that do not map to a reference terminology.
It substitutes synonymous medical terms for the original terms contained in the text, thus making it difficult for someone with access to diagnostic terms found in the original report to match text in the output record (another type of attack on confidentiality).
It maintains the original order of terms in sentences, preserving standard stop words. This integrity allows readers (and computer parsers) of scrubbed text to construct grammatical (logical) relationships between output terms in scrubbed sentences.
It provides an output stripped of nonmedical and extraneous information, in keeping with HIPAA recommendations that covered entities restrict transfers of medical information to the minimum necessary to accomplish its purpose.
It provides the terminology code for each medical term included in the sentence, making it possible to index terms and to relate terms to ancestor and descendant terms listed in biomedical ontologies.
It does its job quickly. High-throughput techniques are required to handle large volumes of data.
Also, distributed with the Concept-Match paper is the JHARCOLL list of text phrases from surgical pathology reports. The JHARCOLL file is freely distributed as a tarballed, gzipped file, from:
http://www.julesberman.info/jharcoll.tar.gz
It contains about 568,000 medical phrases that can be used in a variety of informatics projects.
Here is a small excerpt, of consecutive phrases taken directly from the jharcoll file:
drug induced colitis
drug induced damage
drug induced disease
drug induced enteritis
drug induced erosion
drug induced esophagitis
drug induced etiology
drug induced febrile
drug induced forms
drug induced gastric injury
drug induced gastric ulcers
drug induced gastritis
drug induced gingival hypertrophy
drug induced granulomas
drug induced granulomatous
drug induced granulomatous disease
drug induced granulomatous hepatitis
drug induced gut
drug induced gut lesions
drug induced hepatic
drug induced hepatic granulomas
drug induced hepatitis
drug induced hypersensitivity reaction
drug induced immune reaction
drug induced inflammatory disease
drug induced injury
drug induced interstitial
drug induced interstitial lung disease
drug induced interstitial nephritis
drug induced interstitial nephritis clearly
drug induced intestinal inflammatory disease
drug induced intrahepatic cholestasis
drug induced lesion
drug induced lesions
drug induced liver
drug induced liver disease
drug induced liver injury
drug induced lung
drug induced lung disease
drug induced lung injury
drug induced lupus
drug induced lupus erythematosus
drug induced marrow depression
drug induced mucosal injury
drug induced myocarditis
drug induced nephritis
drug induced neutropenia
drug induced pancytopenia
drug induced process
drug induced reaction
drug induced submassive necrosis
drug induced thrombocytopenia
drug induced thrombotic
drug induced ulcer
drug induced ulceration
drug induced ulcers
drug induced vascular disease
drug induced vasculitis
drug induced veno occlusive disease
drug induced vs
drug indused
drug infusion instrument
drug ingestion
drug ingestion aside
drug ingestion history
drug injestion
drug injuries
drug injury
drug intake
drug levels
drug nephrotoxicity
drug nephrotoxicity caused
drug ointment
drug pigmentation
drug presence
drug rash
drug rash versus gvhd
drug reaction
drug reaction given
drug reaction viral exanthem
drug reaction vs
drug reaction vs gvh
drug reactions
drug reactions might
drug recently
drug regimen
drug residue
drug rx
drug rx toxicity
drug rxn
drug stress
drug therapy
drug toxic
drug toxicity
-Jules Berman
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.
I urge you to explore my book. Google books has prepared a generous preview of the book contents.
tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, de-identification, hipaa, medical confidentiality