Specified Life: April 2007

Friday, April 6, 2007

Tissue Microarray Data Exchange Specification

The Tissue Microarray Data Exchange Specification (TMA Specification) was developed by the Association for Pathology Informatics (API) and is available as an open access document, published in 2003.

TMAs, first introduced in 1998, are collections of hundreds of tissue cores arrayed into a single paraffin histology block (see Figure).

Each TMA block can be sectioned and mounted onto glass slides, producing hundreds of nearly-identical slides. TMAs permit investigators to use a single slide to conduct controlled studies on large cohorts of tissues, using a small amount of reagent. The source of tissue is only restricted by its availability in paraffin and ranges from cores of embedded cultured cells to tissues from any higher organism. In a typical TMA study, every TMA core is associated with a rich variety of data elements (image, tissue diagnosis, patient demographics or other biomaterial description, quantified experimental results). Under ideal circumstances, a single paraffin TMA block can be sectioned into nearly identical glass slides dispensed to many different laboratories. These laboratories may use different experimental protocols. They may capture data using different instruments, different databases, different data architectures, different data elements and immensely different formats. These laboratories would vastly increase the value of their experimental findings if they merged their findings with those of the other laboratories that used the same TMA block. Unfortunately, the practice of merging TMA data sets obtained at different laboratories using different information systems was infrequently practiced. A key barrier to this process was the incompatibility of the individual data sets. There simply was no community-based specification for exchanging TMA data. Without such a specification, TMA data files could not be effectively shared or merged among laboratories using different TMA applications.

Since publication of the TMA Specification week, the TMA Specification was downloaded from the BioMed Central site 7,424 times (April 2007).

At Google Scholar, the TMA specification paper was listed as being cited by 36 papers.

The TMA Specification has been adopted into several TMA database applications.

The common data elements for the specification are freely available.

An implemenation of the specification by an NCI-funded research consortium is also available.

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.

I urge you to explore my book. Google books has prepared a generous preview of the book contents.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining

Wednesday, April 4, 2007

Expressing triples in RDF

In yesterday's post, we discussed assertions composed of "triples". The "triples" that form the basis of RDF statements are: Specified subject then metadata then data.

Example: “Jules Berman” “blood glucose level” “85”

Jules Berman (subject)
blood glucose level (metadata or data describing the data that follows)
85 (the data)

There is a specific syntax for expressing triples in RDF. Let us create an RDF triple whose subject is the jpeg image file specified as:

http://www.gwmoore.org/ldip/ldip2103.jpg (subject)

dc:title (metadata)

and

"Normal Lung" (data)

In RDF syntax:

< rdf:Description
rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg" >
< dc:title > Normal Lung< /dc:title >
< /rdf:Description >

An example of three triples is proper RDF syntax is:

< rdf:Description
rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg" >
< dc:title > Normal Lung< /dc:title >
< /rdf:Description >
< rdf:Description
rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg" >
< dc:creator > Bill Moore< /dc:creator >
< /rdf:Description >
< rdf:Description
rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg" >
< dc:date > 2006-06-28< /dc:date >
< /rdf:Description >

RDF permits you to collapse multiple triples that apply to a single subject. The following RDF:Description statement is equivalent to the three prior triples:

< rdf:Description
rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg" >
< dc:title > Normal Lung< /dc:title >
< dc:creator > Bill Moore< /dc:creator >
< dc:date > 2006-06-28< /dc:date >
< /rdf:Description >

An example of a short but well-formed RDF image specification document is:

< ?xml version="1.0"? >
< rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/" >
< rdf:Description
rdf:about="http://www.gwmoore.org/ldip/ldip2103.jpg" >
< dc:title > Normal Lung< /dc:title >
< dc:creator > Bill Moore< /dc:creator >
< dc:date > 2006-06-28< /dc:date >
< /rdf:Description >
< /rdf:RDF >

The first line tells you that the document is XML. The second line tells you that the XML document is an RDF resource. The third and fourth lines are the namespace documents that are referenced within the document (more about this later). Following that is the RDF statement that we have already seen.

Believe it or not, this is 95% of what you need to know to specify your data with RDF. We will provide use-case examples in future blog posts.

-Jules Berman

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Tuesday, April 3, 2007

More Introduction to RDF

As discussed in an earlier post, RDF (Resource Description Framework) is a formal method for describing specified data objects with paired metadata and data.

It is important to understand that in informatics, assertions only have meaning when a pair of metadata and data (the descriptor for the data and the data itself) is assigned to a specific subject.

The "triples" that form the basis of RDF specifications are: Specified subject then metadata then data

Examples of triples that might be found in a medical dataset:

“Jules Berman” “blood glucose level” “85”
“Mary Smith” “blood glucose level” “90”
“Samuel Rice” “blood glucose level" "200"
“Jules Berman” “eye color” “brown”
“Mary Smith” “eye color” “blue”
“Samuel Rice” “eye color" "green"

Some triples found in a haberdasher's dataset

“Juan Valdez” “hat size” “8”
“Jules Berman” “hat size” “9”
“Homer Simpson” “hat size” “9”
“Homer Simpson” “hat_type” “bowler”

Triples collected from both datasets whose subject is "Jules Berman"

“Jules Berman” “blood glucose level” “85”
“Jules Berman” “eye color” “brown”
“Jules Berman” “hat size” “9”

This is a simple example of data integration over heterogeneous datasets!

Triples can port their meaning between different databases because they bind described data to a specified subject. This supports data integration of heterogeneous data and facilitates the design of software agents. A software agent, as used here, is a program that can interrogate multiple RDF documents on the web, initiating its own actions based on inferences yielded from retrieved triples.

RDF (Resource Description Framework) is a syntax for writing computer-parsable triples. For RDF to serve as a general method for describing data objects, we need to answer the following four questions:.

1. How does the triple convey the unique identity of its subject? In the triple, “Jules Berman” “blood glucose level” “85”, The name "Jules Berman" is not unique and may apply to several different people.

2. How do we convey the meaning of metadata terms? Perhaps one person's definition of a metadata term is different from another person's. For example, is "hat size" the diameter of the hat, or the distance from ear to ear on the person who is intended to wear the hat, or a digit selected from a pre-defined scale?

3. How can we constrain the values described by metadata to a specific datatype? Can a person have an eye color of 8? Can a person have an eye color of "chartreuse"?

4. How can we indicate that a unique object is a member of a class and can be described by metadata shared by all the members of a class?

In subsequent blog posts, we'll examine how RDF provides answers to these four questions.

-Jules Berman tags: data integration, meaning, rdf, specifications, standards, triples, science

Monday, April 2, 2007

Ex Parte Reexamination of Patents

The USPTO (U.S. Patent and Trademark Organization) offers a method for the ex parte review of a previously issued patent.

This means that a single party (ex parte means from one party) can request the USPTO to review a patent to determine whether it was issued in error. Such a procedure would not involve the entity holding the patent.

The USPTO has specific criteria for determining when an ex parte review can be permitted, hinging on "a substantial new question of patentability".

This process could occasionally be an option for SDOs (standards development organizations) that are not directly involved in a patent infringement lawsuit involving the use of their standard. If the SDO wishes to defend the free use of their standard in cases for which an encumbering patent seems trivial or non-original, this might be something to consider.

This post (and any post on this blog site that touches on a legal subject) is not legal advice. This is just a discussion of a published USPTO document.

-Jules Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published in 2013 by Morgan Kaufmann.