In today's blog, I'm posting a Perl script that extracts, from a large nomenclature, the terms that cannot be composed from doublets contained in other terms (i.e., the terms that must include unique doublets).
The key lines in the script are (that create the list of doublets) are shown here:
foreach $thing (@words)
{
$doublet = "$oldthing $thing";
if ($doublet =~ /^[a-z]+ [a-z]+$/)
{
$doublethash{$doublet} =
$doublethash{$doublet} + 1;
}
$oldthing = $thing;
}
These lines are used in virtually every Perl script that uses the doublet method. Basically, they move through an array consisting of the consecutive words in a nomenclature term, two words at a time, creating a new doublet and a new member of a doublet hash structure, with each loop. If you know Perl, this little piece of code should be easy to understand.
The entire Perl script follows here. As with all my posted scripts, the software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.
#!/usr/local/bin/perl
open(TEXT,"neocl.xml")||die"cannot";
open(OUT,">dubuniq.txt")||die"cannot";
$line = " ";
while ($line ne "")
{
$line = <TEXT>;
next if ($line !~ /\"(C[0-9]{7})\"/);
next if ($line !~ /\"\> ?(.+) ?\<\//);
$line =~ /\"\> ?(.+) ?\<\//;
$phrase = $1;
@words = split(/ /, $phrase);
foreach $thing (@words)
{
$doublet = "$oldthing $thing";
if ($doublet =~ /^[a-z]+ [a-z]+$/)
{
$doublethash{$doublet} =
$doublethash{$doublet} + 1;
}
$oldthing = $thing;
}
}
close TEXT;
open(TEXT,"neocl.xml")||die"cannot";
$phrase = "";
$line = " ";
$count = 0;
while ($line ne "")
{
$oldthing = "";
$rightflank = "";
$leftflank = "";
$line = <TEXT>;
next if ($line !~ /\"(C[0-9]{7})\"/);
next if ($line =~ /\"C0000000\"/);
next if ($line =~ /\"C0000001\"/);
next if ($line !~ /\"\> ?(.+) ?\<\//);
$line =~ /\"\> ?(.+) ?\<\//;
$phrase = $1;
@words = split(/ /, $phrase);
next if (scalar(@words) < 3);
foreach $thing (@words)
{
$newdoublet = "$oldthing $thing";
if ($newdoublet =~ /^[a-z]+ [a-z]+$/)
{
if (exists($doublethash{$newdoublet}))
{
if ($doublethash{$newdoublet} == 1)
{
if ($phrase =~ /[a-z]+ $oldthing/)
{
$leftflank = $&;
}
if ($phrase =~ /$thing [a-z]+/)
{
$rightflank = $&;
}
unless ($doublethash{$leftflank} > 1
&& $doublethash{$rightflank} > 1)
{
$uniqphrase{$phrase} = "";
}
}
}
}
$oldthing = $thing;
}
}
while ((my $key, my $value) = each(%uniqphrase))
{
$count++;
print OUT "$count $key\n";
}
exit;
The output consists of a file composed of a list of terms, one line per term, that cannot be constructed from doublets found in other terms. This output was discussed in a prior blog. A sample list of doublets is available for download.
- Jules Berman
My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.
I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.
tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, doublet method, medical linguistics, medical algorithm, nomenclature

 
