Monday, January 14, 2008

Parsable Doublets List now available in public domain

Word doublets are two-word phrases that appear in text (i.e., they are not randomly chosen two-word sequences.

Doublets can be used in a variety of informatics projects: indexing, data scrubbing, nomenclature curation, etc. Over the next few days, I will provide examples of doublet-based informatics projects.

A list of over 200,000 word doublets is available for download.

The list was generated from a large narrative pathology text. Thus, the doublets included here would be particularly suitable for informatics projects involving surgical pathology reports, autopsy reports, pathology papers and books, and so on.

The Perl script that generated the list of doublets by parsing through a text file ("pathold.txt"), is shown:

$var = <TEXT>;
$var =~ s/\n/ /g;
$var =~ s/\'s//g;
$var =~ tr/a-zA-Z\'\- //cd;
@words = split(/ +/, $var);
foreach $thing (@words)
$doublet = "$oldthing $thing";
if ($doublet =~ /^[a-z]+ [a-z]+$/)
$oldthing = $thing;
close TEXT;
@wordarray = sort(keys(%doublethash));
print OUT join("\n",@wordarray);
close OUT;

You can generate your own list by substituting any text file you like for "pathold.txt". Keep in mind that the Perl script slurps the entire text file into a string variable, so the script won't work if you use a file that exceeds the memory of the computer. For most computers (with RAM memories that exceed 256 MBytes) this will not be a problem. On my computer (about 2.8 GHz and 512 Mbyte RAM) the script takes about 5 seconds to parse a 9 Megabyte text file).

Since the doublet list below consists of a non-narrative collection of words, it cannot be copyrighted (i.e., it is distributed as a public domain file).

-Jules Berman
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, biomedical informatics, curation, data scrubbing, deidentification, medical nomenclature, Perl script, public domain, doublets list