Saturday, March 15, 2008

RDF Schemas and directed graphs

In yesterday's blog I provided a Perl script that transformed an RDF Schema and which was used to produce a schematic of the Neoplasm Classification.

Today, I want to go back and review the relationship between RDF Schemas and directed graphs and to provide a rationale for why this is possible and to provide you with some tools you might need to do this for your own RDF Schemas.

RDF (Resource Description Framework) is a syntax and logic for making computer parsable statements that have meaning. RDF uses the familiar tagging syntax of XML, so all RDF documents are also XML documents.

Meaning, in the informatics field, is achieved when a datum and a description of the datum (a metadatum) are bound to an identified object. This is the "triple" that underlies all of RDF.

Here is a data/metadata pair:

<date>March 15, 2008</data>

This kind of data/matadata pair, common to XML, has no meaning because it is not bound to an identified object.

Now consider:

Ides of March occurs on <date>March 15, 2008</data>

This gets a little closer to an RDF statement (with meaning) because the data/metadata pair are assigned to an object (Ides of March).

RDF has a formal way of defining objects [and their properties, which we won't discuss here]. This is called RDF Schema. You can think of RDF Schema as a dictionary for the terms in an RDF data document. RDF Schema is written in RDF syntax. This means that all RDF Schemas are RDF documents and consist of statments in the form of triples.

For today, the important point about RDF Schemas is that they create logical relationships among objects in a domain that can be translated into directed graphs (graphs consisting of connected nodes and arcs and directions for the arcs).

An example of two RDF statements:

<rdfs:Class rdf:ID="Fibrous_tissue">
<rdfs:subClassOf
neo:resource="#Connective_tissue"/>
</rdfs:Class>

<rdfs:Class rdf:ID="Mesoderm_primitive">
<rdfs:subClassOf
neo:resource="#Mesoderm"/>
</rdfs:Class>


Every class of object is a subclass of another class of object. A Perl script, such as the one that I provided yesterday, can parse and RDF Schema and transform it into a GraphViz script. This is a type of poor-man's metaprogramming (using a programming language to generate programs in another programming language). GraphViz is a free, open source graphic scripting language that renders a wide range of graphic representations for specified object relationships.

Information on GraphViz is available at:

http://www.graphviz.org/

RDF is the syntax and logic underlying the semantic web, and every serious informatician must learn to use RDF. There are quite a few books and articles written on RDF. My book, Ruby Programming for Medicine and Biology , has a large section on RDF with some examples showing how to build and use RDF Schemas and RDF documents. In my opinion, Ruby is a better language that Perl or Python for dealing with RDF logic. Also, based on my limited ability to survey all of the literature, it would seem that a really good book that explains RDF and provides good examples for building RDF documents and drawing useful inferences from multiple RDF documents, has not been written. When I find one, I'll let you know.

- Jules Berman

key words: RDF Schema, triple, triples, ontology, digraph
Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Friday, March 14, 2008

Creating a directed graph from an RDF schema

In yesterday's blog , I announced the newest version of the Neoplasm Classification. The Neoplasm Classification is an open source document available as a plain-text flat-file, as an XML file, and as and RDF document. The top of the RDF document contains the complete RDF Schema for the Classification. The remainder of the file (>99% of the file) is devoted to the entries for the individual neoplasm terms (over 135,000 of them).

I have a small schematic, that represents the organization of the Neoplasm Classification.



In addition to this small schematic, I have a large schematic that represents the complete hierarchy of the Classification. Though it is too large to fit in this blog, each part of the schematic is quite simple.

It took a few seconds to create the complete diagram of the classification, in digraph (directed graph) form, using a short Perl script that parsed the Classification's RDF Schema + one command-line instruction to the free and open source GraphViz application.

Here's the script. As per usual, the following statement applies. The software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or
other dealings in the software.

#!/usr/bin/perl
open (TEXT, "schema.txt");
open (OUT, ">schema.dot");
$/ = "\<\/rdfs\:Class>";
print OUT "digraph G \{\n";
print OUT "size\=\"15\,15\"\;\n";
print OUT "ranksep\=\"2\.00\"\;\n";
$line = " ";
while ($line ne "")
{
$line = <TEXT>;
last if ($line !~ /\<rdfs\:/);
if ($line =~
/\:resource\=\"[a-z0-9\:\/\_\.\-]*\#([a-z\_]+)\"/i)
{
$father = $1;
}
if ($line =~ /rdf\:ID\=\"([a-z\_]+)\"/i)
{
$child = $1;
}
print OUT "$father \-\> $child\;\n";
print "$father \-\> $child\;\n";
}
print OUT "\}";
exit;

If you work with RDF (and every biomedical professional should understand how RDF is used to specify data), you will want a method that can instantaneously render a schematic of your RDF Schema (ontology) or of any descendant section of your Schema.

Tomorrow, I'll go into some detail to describe just how the Perl script produces a GraphViz script that can render an RDF Schema as a visual digraph.

- Jules Berman


My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 year by Morgan Kaufmann.



I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.
tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, medical autocoding, medical data scrubbing, medical data scrubber, medical record scrubbing, medical record scrubber, medical text parsing, medical autocoder, nomenclature, terminology, ontology, VizGraph, Neoplasm Classification, semantic web, ontologies, digraph, directed graph, tumor, cancer, tumour, neoplasia

Thursday, March 13, 2008

Updated files for the Neoplasm Classification now available

Updated versions of the Neoplasm Classification are now available:

The Neoplasm Classification contains over 135,000 classified names of neoplasms in a biological hierarchy based on developmental lineage of the tumor. It is the largest and most comprehensive neoplasm nomenclature in existence. It is available as a simple XML file, an RDF ontology, or a plain flat-file.

These files were prepared by Jules J. Berman. The first version of this file was created November 15, 2003. The modifications were created on March 13, 2008.

The following applies to the distributed documents:

Copyright (c) 2007-2008 Jules J. Berman. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is available at:
http://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License

The files are provided "as is", without warranty of any kind, expressed or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the author or copyright holder be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

An explanation of the classification can be found in the following
two publications, which should be cited in any publication or work that may
result from any use of this file.

Berman JJ. Tumor classification: molecular analysis meets Aristotle.
BMC Cancer 4:8, 2004.

Berman JJ. Biomedical Informatics . Jones and Bartlett Publishers,
Sudbury, MA, 2007.

In the Neoplasm Classification, all classified names of neoplasms are coded with a "C" followed by a 7 digit number other than 0000000 or 0000001.

For example, "C9168000" = rectal signet ring adenocarcinoma

In addition to classified terms, there are four groups of unclassified terms that are provided special items that follow the list of classified terms in this file.

"C0000000"
"C0000001"
"S" followed by 7 digits
"ST" followed by 7 digits

This list of unclassified terms coded as "C0000000" consists of general cancer terms that do not specify any particular neoplasm; overly specific terms that provide so-called pre-coordinated annotations related to terms contained elsewere in the Classification; and valid terms that have not been added (yet) to the list of classified neoplasm terms.

Examples of non-specific cancer-related terms are:

borderline tumor
mucinous tumor
blast crisis
preinvasive carcinoma
dysplasia

Examples of overly specific terms are:

squamous carcinoma of the nasal vestibule
gastric non-hodgkin lymphoma of mucosa-associated lymphoid tissue
primary primitive neuroectodermal tumor of the kidney

The terms that are coded with "C0000001" are precancers and related conditions that have not yet been added to the list of classified terms.

The terms that are coded "S" followed by 7 digits are inherited syndromes that have a neoplastic component (i.e., the occasional or frequent appearance of neoplasms in the syndrome).

The terms that are coded "ST" followed by 7 digits are staging terms used by oncologists.

The classification is intended for informatics projects that use computer parsing techniques. Programmers should simply insert statements that filter the unclassified terms included in the file.

Additional information may be available from the author's web site:
http://www.julesberman.info/

The gzipped version of the RDF file (under 1 Megabyte)
http://www.julesberman.info/neorxml.gz

The flat file version, listing each term followed by its lineage (gzipped file).
http://www.julesberman.info/neoself.gz

The plain old XML version, with no RDF semantics (gzipped file).
http://www.julesberman.info/neoclxml.gz

- Jules Berman

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, medical nomenclature, ontology, classification, rdf, resource description framework, xml, semantic web

Tuesday, March 11, 2008

MISFISHIE Specification (for in situ hybridization and immunohistochemistry) now available

Under the leadership of Eric Deutsch, a specification for annotating In Situ Hybridization and Immunohistochemistry Experiments (MISFISHIE) has just been published in Nature Biotechnology.

Deutsch EW, Ball CA, Berman JJ, Bova GS, Brazma A, Bumgarner RE, Campbell D, Causton HC, Christiansen JH, Daian F, Dauga D, Davidson DR, Gimenez G, Goo YA, Grimmond S, Henrich T, Herrmann BG, Johnson MH, Korb M, Mills JC, Oudes AJ, Parkinson HE, Pascal LE, Pollet N, Quackenbush J, Ramialison M, Ringwald M, Salgado D, Sansone SA, Sherlock G, Stoeckert CJ Jr, Swedlow J, Taylor RC, Walashek L, Warford A, Wilkinson DG, Zhou Y, Zon LI, Liu AY, True LD. Minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE). Nat Biotechnol. 2008 Mar;26(3):305-12.

MISFISHIE is modelled after the MIAME (Minimum Information About a Microarray Experiment) specification for microarray experiments.

It has been a constant theme in this blog that data specifications are, in many instances, much better than data standards. Data specifications, like MIAME and MISFISHIE specify the information content without dictating a format for encoding that information.

Nature Biotechnology put up the entire MISFISHIE specification for public comment, and it is currently available at:

http://www.nature.com/nbt/consult/pdf/Deutsch.pdf


- Jules Berman
In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, pathology,annotation, data sharing, data specifications

Monday, March 10, 2008

Gaucher cells occurring in diseases other than Gaucher disease

From time to time, I upload some of my old publications to my web site. Today, I uploaded:

Acquired Gaucher Cells Located in Dermis near a Malignant Hidradenoma Berman JJ and Iseri OA. Ultrastructural Pathology, 12:245-246, 1988.



FIG. 1 (from paper) Dermal Pacinian corpuscle surrounded by Gaucher cell histiocytes.

- Jules Berman
In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.



I urge you to read more about my book. There's a generous preview of the book at the Google Books site.

tags: common disease, orphan disease, orphan drugs, genetics of disease, disease genetics, rules of disease biology, rare disease, pathology

Tuesday, March 4, 2008

Medical Linguistics, Part 5

The past few blogs have been a series devoted to Medical Linguistics. Yesterday's blog discussed the fundamental linguistic principles underlying the doublet method.

In today's blog, I'm posting a Perl script that extracts, from a large nomenclature, the terms that cannot be composed from doublets contained in other terms (i.e., the terms that must include unique doublets).

The key lines in the script are (that create the list of doublets) are shown here:

foreach $thing (@words)
{
$doublet = "$oldthing $thing";
if ($doublet =~ /^[a-z]+ [a-z]+$/)
{
$doublethash{$doublet} =
$doublethash{$doublet} + 1;
}
$oldthing = $thing;
}


These lines are used in virtually every Perl script that uses the doublet method. Basically, they move through an array consisting of the consecutive words in a nomenclature term, two words at a time, creating a new doublet and a new member of a doublet hash structure, with each loop. If you know Perl, this little piece of code should be easy to understand.

The entire Perl script follows here. As with all my posted scripts, the software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. in no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

#!/usr/local/bin/perl
open(TEXT,"neocl.xml")||die"cannot";
open(OUT,">dubuniq.txt")||die"cannot";
$line = " ";
while ($line ne "")
{
$line = <TEXT>;
next if ($line !~ /\"(C[0-9]{7})\"/);
next if ($line !~ /\"\> ?(.+) ?\<\//);
$line =~ /\"\> ?(.+) ?\<\//;
$phrase = $1;
@words = split(/ /, $phrase);
foreach $thing (@words)
{
$doublet = "$oldthing $thing";
if ($doublet =~ /^[a-z]+ [a-z]+$/)
{
$doublethash{$doublet} =
$doublethash{$doublet} + 1;
}
$oldthing = $thing;
}
}
close TEXT;
open(TEXT,"neocl.xml")||die"cannot";
$phrase = "";
$line = " ";
$count = 0;
while ($line ne "")
{
$oldthing = "";
$rightflank = "";
$leftflank = "";
$line = <TEXT>;
next if ($line !~ /\"(C[0-9]{7})\"/);
next if ($line =~ /\"C0000000\"/);
next if ($line =~ /\"C0000001\"/);
next if ($line !~ /\"\> ?(.+) ?\<\//);
$line =~ /\"\> ?(.+) ?\<\//;
$phrase = $1;
@words = split(/ /, $phrase);
next if (scalar(@words) < 3);
foreach $thing (@words)
{
$newdoublet = "$oldthing $thing";
if ($newdoublet =~ /^[a-z]+ [a-z]+$/)
{
if (exists($doublethash{$newdoublet}))
{
if ($doublethash{$newdoublet} == 1)
{
if ($phrase =~ /[a-z]+ $oldthing/)
{
$leftflank = $&;
}
if ($phrase =~ /$thing [a-z]+/)
{
$rightflank = $&;
}
unless ($doublethash{$leftflank} > 1
&& $doublethash{$rightflank} > 1)
{
$uniqphrase{$phrase} = "";
}
}
}
}
$oldthing = $thing;
}
}

while ((my $key, my $value) = each(%uniqphrase))
{
$count++;
print OUT "$count $key\n";
}
exit;

The output consists of a file composed of a list of terms, one line per term, that cannot be constructed from doublets found in other terms. This output was discussed in a prior blog. A sample list of doublets is available for download.

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, doublet method, medical linguistics, medical algorithm, nomenclature

Monday, March 3, 2008

Medical Linguistics, Part 4

In the past few blogs, . I have been covering some special linguistic aspects of medical terminologies.

Let's summarize:

1. In a large medical nomenclature, singlets (single-word terms) are infrequent. In our example terminology, the Neoplasm Classification, there are about 500 singlets in a classified nomenclature that contains more than 130,000 terms! By the way, the Neoplasm Classification is available for download as a gzipped XML file.

2. All multi-word terms are composed of doublets (two-word terms), and doublets have a more specific meaning than do singlets.

3. Most multi-word terms in medical nomenclatures are composed of doublets that are found in other terms from the same nomenclature. In the Neoplasm Classification (exceeding 130,000 different terms), there are fewer than 300 terms that cannot be composed of doublets found from other terms.

What do these empirical observations imply?

1. If you parse through any medical text, and you encounter a sequence of words composed of doublets [that are found in a nomenclature], the sequence of words is likely to contain terms from the nomenclature.

2. Conversely, if you parse through any medical text, and you encounter a sequence of words composed of doublets [that are NOT found in a nomenclature], the sequence of words is likely NOT to contain terms from the nomenclature.

3. If you parse through any medical text, and you encounter a sequence of words composed of doublets [that are found in a nomenclature], and the sequence of words does not contain terms from the nomenclature, then the sequence of words may contain one or more new terms that can be added to the nomenclature.

In the next blogs, we will explore how to use these ideas to design software software that can:

1. Automatically extract terms from a medical corpus (large text file)

2. Automatically code the extracted terms that match existing terms in the nomenclature

3. Automatically remove extraneous words from medical recrods that may contain patient identifiers or private information related to patients

4. Identify new candidate terms that may need to be added to the nomenclature

If you understand this blog, and if you have a little programming skill, you can write simple, fast, medical software that can perform many of the common computational tasks encountered by biomedical informaticians.

As per usual, most of the topics explored in this blog have been discussed in my book, Biomedical Informatics. Programming skills for biomedical professionals are taught in my books, Perl Programming for Medicine and Biology and Ruby Programming for Medicine and Biology.

- Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, medical autocoding, medical data scrubbing, medical data scrubber, medical record scrubbing, medical record scrubber, medical text parsing, medical autocoder, nomenclature, terminology, biomedical informatics, doublet method, medical terminology, medical autocoding, medical autocoder, medical record de-identification, medical record deidentification, medical informaticist, biomedical informaticist, medical informatics, biomedical algorithms, medical algorithms

Sunday, March 2, 2008

Medical Linguistics, Part 3

In yesterday's blog, I wrote that medical terms are composed of doublets, each of which convey a very specific meaning. Individual words seldom have a single meaning.

Most terms in a medical nomenclature are composed of doublets found elsewhere in the terminology. In other words, unique terms are composed of common doublets, with very few exceptions.

The Neoplasm Classification contains over 130,000 names of neoplasms. Among these large numbers of terms, there are about 1,500 terms that contain a doublet that is uniquely found in the term (i.e., not found in one or more additional terms in the nomenclature). This represents about 1% of the total number of terms in the nomenclature. (The entire Neoplasms Classification is available as a gzipped file from my web site.

The Perl script that produces the list of terms that cannot be constructed from doublets found in other terms, is discussed in a later blog.

- Jules Berman

key words: doublet method, neoplasm classification, nomenclature,

Saturday, March 1, 2008

Ruby Programming for Medicine and Biology limited preview at Google Books

I just noticed that the Google Book site now features a
limited preview of my book,
Ruby Programming for Medicine and Biology.

If you're curious, the Google book site is nearly as good as skimming through a copy of the physical book.

- Jules Berman

key words: biomedical informatics, book review

Medical linguistics, Part 2

This is a continuation of yesterday's blog.

One of the many challenges in the field of machine translation is that expressions (multi-word terms) convey ideas that transcend the meanings of the individual words in the expression. Consider the following sentence:

"The ciliary body produces aqueous humor."

The example sentence has unambiguous meaning to anatomists, but each word in the sentence can have many different meanings. "Ciliary" is a common medical word, and usually refers to the action of cilia. Cilia are found throughout the respiratory and GI tract and have an important role locomoting particulate matter. The word "body" almost always refers to the human body. The term "ciliary body" should (but does not) refer to the action of cilia that move human bodies from place to place. The word "aqueous" always refers to water. Humor relates to something being funny. The term "aqueous humor" should (but does not) relate to something that is funny by virtue of its use of water (as in squirting someone in the face with a trick flower). Actually, "ciliary body" and "aqueous humor" are each examples of medical doublets whose meanings are specific and contextually constant (i.e. always mean one thing). Furthermore, the meanings of the doublets cannot be reliably determined from the individual words that constitute the doublet, because the individual words have several different meanings. Basically, you either know the correct meaning of the doublet, or you don't.

Any sentence can be examined by parsing it into an array of intercalated doublets:

"The ciliary, ciliary body, body produces, produces aqueous, aqueous humor."

The important concepts in the sentence are contained in two doublets (ciliary body and aqueous humor). A nomenclature containing these doublets would allow us to extract and index these two medical concepts. A nomenclature consisting of single words might miss the contextual meaning of the doublets.

What if the term were larger than a doublet? Consider the tumor "orbital alveolar rhabdomyosarcoma." The individual words can be misleading. This orbital tumor is not from outer space, and the alveolar tumor is not from the lung. The 3-word term describes a sarcoma arising from the orbit of the eye that has a morphology characterized by tiny spaces of a size and shape as may occur in glands (alveoli). The term "orbital alveolar rhabdomyosarcoma" can be parsed as "orbital alveolar, alveolar rhabdomyosarcoma" Why is this any better than parsing the term into individual words, as in "orbital, alveolar, rhabdomyosarcoma"? The doublets, unlike the single words, are highly specific terms that are unlikely to occur in association with more than a few specific concepts.

Very few medical terms are single words. In the Neoplasm classification, there are over 135,000 terms and only about 500 are single words. The doublet method uses the multi-word feature of medical terms to extract meaning from text.

This topic is covered in detail in my book, Biomedical Informatics.
To be continued.

- Jules Berman


My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.



I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining, medical autocoding, medical data scrubbing, medical data scrubber, medical record scrubbing, medical record scrubber, medical text parsing, medical autocoder, nomenclature, terminology