Saturday, March 12, 2016

DATA SIMPLIFICATION: Building Word Lists


Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.


Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.


Word lists, for just about any written language for which there is an electronic literature, are easy to create. Here is a short Python script, words.py, that prompts the user to enter a line of text. The script drops the line to lowercase, removes the carriage return at the end of the line, parses the result into an alphabetized list, removes duplicate terms from the list, and prints out the list, with one term assigned to each line of output. This words.py script can be easily modified to create word lists from plain-text files (See Glossary item, Metasyntactic variable).
#!/usr/local/bin/python
import sys, re, string
print "Enter a line of text to be parsed into a word list"
line = sys.stdin.readline()
line = string.lower(line)
line = string.rstrip(line)
linearray = sorted(set(re.split(r' +', line)))
for i in range(0, len(linearray)):
   print(linearray[i])
exit
Here is some a sample of output, when the input is the first line of Joyce's Finegans Wake:
c:\ftp>words.py
Enter a line of text to be parsed into a word list

a way a lone a last a loved a long the riverrun, past Eve and Adam's, from 
swerve of shore to bend of bay, brings us by a commodius vicus

a
adam's,
and
bay,
bend
brings
by
commodius
eve
from
last
lone
long
loved
of
past
riverrun,
shore
swerve
the
to
us
vicus
way
Here is a nearly equivalent Perl script, words.pl, that creates a wordlist from a file. In this case, the chosen file happens to be "gettbysu.txt", containing the full-text of the Gettysburg address. We could have included the name of any plain-text file.
#!/usr/local/bin/perl
open(TEXT, "gettysbu.txt");
undef($/); 
$var = lc();
$var =~ s/\n/ /g;
$var =~ s/\'s//g;
$var =~ tr/a-zA-Z\'\- //cd;
@words = sort(split(/ +/, $var));
@words = grep($_ ne $prev && (($prev) = $_), @words);
print (join("\n",@words));
exit;
The words.pl script was designed for speed. You'll notice that it slurps the entire contents of a file into a string variable. If we were dealing with a very large file, that exceeded the functional RAM memory limits of our computer, we would need to modify the script to parse through the file line-by-line.

Aside from word lists you create for yourself, there are a wide variety of specialized knowledge domain nomenclatures that are available to the public (1), (2), (3), (4), (5), (6). Linux distributions often bundle a wordlist, under filename "words", that is useful for parsing and natural language processing applications. A copy of the linux wordlist is available at:

http://www.cs.duke.edu/~ola/ap/linuxwords

Curated lists of terms, either generalized, or restricted to a specific knowledge domain, are indispensable for a variety of applications (e.g., spell-checkers, natural language processors, machine translation, coding by term, indexing. Personally, I have spent an inexcusable amount of time creating my own lists, when no equivalent public domain resource was available.

- Jules Berman

key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, system calls, Perl, Python, open source tools, utility, word lists, jules j berman

References:

[1] Medical Subject Headings. U.S. National Library of Medicine. Available at: https://www.nlm.nih.gov/mesh/filelist.html, viewed on July 29, 2015.

[2] Berman JJ. A Tool for Sharing Annotated Research Data: the "Category 0" UMLS (Unified Medical Language System) Vocabularies. BMC Med Inform Decis Mak, 3:6, 2003.

[3] Berman JJ Tumor taxonomy for the developmental lineage classification of neoplasms. BMC Cancer 4:88, 2004. http://www.biomedcentral.com/1471-2407/4/88, viewed Jan. 1, 2015.

[4] Hayes CF, O'Connor JC. English-Esperanto Dictionary. Review of Reviews Office, London, 1906. Availalable at: http://www.gutenberg.org/ebooks/16967 viewed on July 29, 2105.

[5] Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform 40:30-43, 2007.

[6] NCI Thesaurus. National Cancer Institute, U.S. National Institutes of Health, Bethesda, MD. Available at: ftp://ftp1.nci.nih.gov/pub/cacore/EVS/NCI_Thesaurus/ viewed on July 29, 2015.

No comments: