Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading.
Word lists, for just about any written language for which there is an electronic literature, are easy to create. Here is a short Python script, words.py, that prompts the user to enter a line of text. The script drops the line to lowercase, removes the carriage return at the end of the line, parses the result into an alphabetized list, removes duplicate terms from the list, and prints out the list, with one term assigned to each line of output. This words.py script can be easily modified to create word lists from plain-text files (See Glossary item, Metasyntactic variable).
#!/usr/local/bin/python import sys, re, string print "Enter a line of text to be parsed into a word list" line = sys.stdin.readline() line = string.lower(line) line = string.rstrip(line) linearray = sorted(set(re.split(r' +', line))) for i in range(0, len(linearray)): print(linearray[i]) exitHere is some a sample of output, when the input is the first line of Joyce's Finegans Wake:
c:\ftp>words.py Enter a line of text to be parsed into a word list a way a lone a last a loved a long the riverrun, past Eve and Adam's, from swerve of shore to bend of bay, brings us by a commodius vicus a adam's, and bay, bend brings by commodius eve from last lone long loved of past riverrun, shore swerve the to us vicus wayHere is a nearly equivalent Perl script, words.pl, that creates a wordlist from a file. In this case, the chosen file happens to be "gettbysu.txt", containing the full-text of the Gettysburg address. We could have included the name of any plain-text file.
#!/usr/local/bin/perl open(TEXT, "gettysbu.txt"); undef($/); $var = lc(The words.pl script was designed for speed. You'll notice that it slurps the entire contents of a file into a string variable. If we were dealing with a very large file, that exceeded the functional RAM memory limits of our computer, we would need to modify the script to parse through the file line-by-line.); $var =~ s/\n/ /g; $var =~ s/\'s//g; $var =~ tr/a-zA-Z\'\- //cd; @words = sort(split(/ +/, $var)); @words = grep($_ ne $prev && (($prev) = $_), @words); print (join("\n",@words)); exit;
Aside from word lists you create for yourself, there are a wide variety of specialized knowledge domain nomenclatures that are available to the public (1), (2), (3), (4), (5), (6). Linux distributions often bundle a wordlist, under filename "words", that is useful for parsing and natural language processing applications. A copy of the linux wordlist is available at:
http://www.cs.duke.edu/~ola/ap/linuxwords
Curated lists of terms, either generalized, or restricted to a specific knowledge domain, are indispensable for a variety of applications (e.g., spell-checkers, natural language processors, machine translation, coding by term, indexing. Personally, I have spent an inexcusable amount of time creating my own lists, when no equivalent public domain resource was available.
- Jules Berman
key words: computer science, data analysis, data repurposing, data simplification, data wrangling, information science, simplifying data, taming data, complexity, system calls, Perl, Python, open source tools, utility, word lists, jules j berman
References:
[1] Medical Subject Headings. U.S. National Library of Medicine. Available at: https://www.nlm.nih.gov/mesh/filelist.html, viewed on July 29, 2015.
[2] Berman JJ. A Tool for Sharing Annotated Research Data: the "Category 0" UMLS (Unified Medical Language System) Vocabularies. BMC Med Inform Decis Mak, 3:6, 2003.
[3] Berman JJ Tumor taxonomy for the developmental lineage classification of neoplasms. BMC Cancer 4:88, 2004. http://www.biomedcentral.com/1471-2407/4/88, viewed Jan. 1, 2015.
[4] Hayes CF, O'Connor JC. English-Esperanto Dictionary. Review of Reviews Office, London, 1906. Availalable at: http://www.gutenberg.org/ebooks/16967 viewed on July 29, 2105.
[5] Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform 40:30-43, 2007.
[6] NCI Thesaurus. National Cancer Institute, U.S. National Institutes of Health, Bethesda, MD. Available at: ftp://ftp1.nci.nih.gov/pub/cacore/EVS/NCI_Thesaurus/ viewed on July 29, 2015.
No comments:
Post a Comment