Specified Life: October 2007

Monday, October 29, 2007

The high level classes in the Developmental Neoplasm ontology

The past several blogs have been devoted to the Developmental Lineage Classification and Taxonony of Neoplasms.

The rationale of the classification is that tumors inherit key cellular pathways through their developmental lineages. This assertion is supported by decades of morphologic evaluations of tumors. More recently, molecular biological observations have shown that genetic markers and pathways are carried through cell lineage. Tumors grouped by cell lineage may share responses to new chemotherapeutic and chemopreventive agents targeted to specific pathways. If this is true, we can start to develop agents (and combinations of agents) that are effective against groups of neoplasms that share a common developmental lineage.

The taxonomy contains the names of over 5,000 different neoplasms, and about 130,000 synonymous terms. It is the most comprehensive listing of neoplasms in the world, and it is distributed under the GNU Free Documentation License. Download information is available from my website home page.

Here are the top level classes in the cancer ontology:


‹rdfs:Class rdf:ID="Neural_tube_parenchyma"›
  ‹rdfs:subClassOf 
       neo:resource="#Neural_tube"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Sub_coelomic"›
  ‹rdfs:subClassOf 
       neo:resource="#Mesoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Endoderm_or_ectoderm"›
  ‹rdfs:subClassOf 
       neo:resource="#Neoplasm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Syndrome"›
  ‹rdfs:subClassOf 
       neo:resource="#Unclassified"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neural_crest"›
  ‹rdfs:subClassOf 
       neo:resource="#Neoplasm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Germ_cell"›
  ‹rdfs:subClassOf 
       neo:resource="#Neoplasm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Sub_coelomic_gonadal"›
  ‹rdfs:subClassOf 
       neo:resource="#Sub_coelomic"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Molar"›
  ‹rdfs:subClassOf 
       neo:resource="#Trophectoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neural_crest_endocrine"›
  ‹rdfs:subClassOf 
       neo:resource="#Neural_crest"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Sub_coelomic_endocrine"›
  ‹rdfs:subClassOf 
       neo:resource="#Sub_coelomic"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Fibrous_tissue"›
  ‹rdfs:subClassOf 
       neo:resource="#Connective_tissue"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Mesoderm_primitive"›
  ‹rdfs:subClassOf 
       neo:resource="#Mesoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Trophectoderm"›
  ‹rdfs:subClassOf 
       neo:resource="#Neoplasm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Sub_coelomic_nephric"›
  ‹rdfs:subClassOf 
       neo:resource="#Sub_coelomic"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Tumor_classification"›
  ‹rdfs:subClassOf 
rdfs:resource=
"http://www.w3.org/2000/01/rdf-schema#Class"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neoplasm"›
  ‹rdfs:subClassOf 
       rdfs:resource="#Tumor_classification"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Vascular"›
  ‹rdfs:subClassOf 
       neo:resource="#Connective_tissue"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Germ_cell_differentiated"›
  ‹rdfs:subClassOf 
       neo:resource="#Germ_cell"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Endoderm_or_ectoderm_parenchymal"›
  ‹rdfs:subClassOf 
       neo:resource="#Endoderm_or_ectoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Peripheral_nervous_system"›
  ‹rdfs:subClassOf 
       neo:resource="#Neural_crest"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Coelomic_ductal"›
  ‹rdfs:subClassOf 
       neo:resource="#Coelomic"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Coelomic_gonadal"›
  ‹rdfs:subClassOf 
       neo:resource="#Coelomic"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Trophoblast"›
  ‹rdfs:subClassOf 
       neo:resource="#Trophectoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Muscle"›
  ‹rdfs:subClassOf 
       neo:resource="#Connective_tissue"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Mesenchyme"›
  ‹rdfs:subClassOf 
       neo:resource="#Mesoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neural_crest_primitive"›
  ‹rdfs:subClassOf 
       neo:resource="#Neural_crest"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neural_crest_ectomesenchymal"›
  ‹rdfs:subClassOf 
       neo:resource="#Neural_crest"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Stage"›
  ‹rdfs:subClassOf 
       neo:resource="#Unclassified"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neural_tube"›
  ‹rdfs:subClassOf 
       neo:resource="#Neuroectoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neuroectoderm"›
  ‹rdfs:subClassOf 
       neo:resource="#Neoplasm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Connective_tissue"›
  ‹rdfs:subClassOf 
       neo:resource="#Mesenchyme"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neural_crest_melanocytic"›
  ‹rdfs:subClassOf 
       neo:resource="#Neural_crest"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neural_tube_lining"›
  ‹rdfs:subClassOf 
       neo:resource="#Neural_tube"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Mesoderm"›
  ‹rdfs:subClassOf 
       neo:resource="#Neoplasm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Unclassified_precancer"›
  ‹rdfs:subClassOf 
       neo:resource="#Unclassified"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Unclassified_cancer"›
  ‹rdfs:subClassOf 
       neo:resource="#Unclassified"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Coelomic"›
  ‹rdfs:subClassOf 
       neo:resource="#Mesoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Bone_cartilage"›
  ‹rdfs:subClassOf 
       neo:resource="#Connective_tissue"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Coelomic_cavity"›
  ‹rdfs:subClassOf 
       neo:resource="#Coelomic"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Heme_lymphoid"›
  ‹rdfs:subClassOf 
       neo:resource="#Mesenchyme"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Adipose_tissue"›
  ‹rdfs:subClassOf 
       neo:resource="#Connective_tissue"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Neuroectoderm_primitive"›
  ‹rdfs:subClassOf 
       neo:resource="#Neuroectoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Endoderm_or_ectoderm_primitive"›
  ‹rdfs:subClassOf 
       neo:resource="#Endoderm_or_ectoderm"/›
‹/rdfs:Class›
                                          
‹rdfs:Class rdf:ID="Unclassified"›
  ‹rdfs:subClassOf 
       neo:resource="#Tumor_classification"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Primordial"›
  ‹rdfs:subClassOf 
       neo:resource="#Germ_cell"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Endoderm_or_ectoderm_surface"›
  ‹rdfs:subClassOf 
       neo:resource="#Endoderm_or_ectoderm"/›
‹/rdfs:Class›

‹rdfs:Class rdf:ID="Endoderm_or_ectoderm_endocrine"›
  ‹rdfs:subClassOf 
       neo:resource="#Endoderm_or_ectoderm"/›
‹/rdfs:Class›

In June, 2014, my book, entitled Rare Diseases and Orphan Drugs: Keys to Understanding and Treating the Common Diseases was published by Elsevier. The book builds the argument that our best chance of curing the common diseases will come from studying and curing the rare diseases.

Sunday, October 28, 2007

Developmental Classification of Neoplasms now an RDF Ontology

I am publishing today the first ontology version of the Developmental Lineage Classification and Taxonomy of Neoplasms. It is available for download in several file versions.

The full ontology is a 10 Megabyte RDF file. Note that the file is so large that some browsers may not be able to open the entire file. On my computer, I had no trouble opening the file in my Internet Explorer browser, but the file was too large for my Mozilla browser.
http://www.julesberman.info/neordf.xml

The file was validated using the w3c validator service at http://www.w3.org/rdf/validator/, with a caveat. The full ontology file (10+ Mbytes) was too large for the validator, so I truncated the ontology, validated the truncated file (that contained all of the classes, subclasses, properties), and left out the repetitive list of terms. Then I took the entire file and validated it with an XML parser to verify that the file was well-formed. That really covers everything (RDF logic and XML structure).

The gzipped version of the RDF file (under 1 Megabyte).
http://www.julesberman.info/neorxml.gz

The flat file version, listing each term followed by its lineage (gzipped file).
http://www.julesberman.info/neoself.gz

The plain old XML version, with no RDF semantics (gzipped file). http://www.julesberman.info/neoclxml.gz

The ontology contains several parts:

1. The neoplasm classification proper (as illustrated in the schematic)

2. A listing of cancer terms that will probably never be entered into the proper classification (more about this later)

3. A listing of hyperplasias or hamartomas, some of which will be entered into the proper classification and others of which will remain in class Hyperplasia

4. A listing of precancer terms

5. A listing of syndromes associated with increased risk for cancer.

In this version, there are 5841 classified types of neoplasms and 130,503 terms representing the 5,841 types of neoplasms.

This represents the largest nomenclature of neoplasms in existence and, with today's publication, the largest formal ontology (in RDF syntax) of neoplasm names.

Over the next few weeks, I'll post additional blogs to further explain the RDF ontology files.

- Jules Berman

Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.

Saturday, October 27, 2007

National Cancer Institute Thesaurus

The National Cancer Institute (NCI) Thesaurus is a free medical vocabulary available in OWL format from:

ftp://ftp1.nci.nih.gov/pub/cacore/EVS/NCI_Thesaurus/

It's really quite an impressive document, and there are very few standardized vocabularies that have been prepared as formal ontologies. The creators wisely used the semantics of OWL (Web Ontology Language), a dialect of RDF.

The NCI thesaurus contains terms related to the interests of the NCI and contains the names of many neoplasms.

This vocabulary has been curated for over a decade by in-house ontologists (NCI employees), contractors, and through the use of domain consultants (including some pathologists). It is updated monthly. A lot of money has gone into the development of the NCI Thesaurus, and it is one of the most worked-on vocabularies in the medical field.

The NCI Thesaurus has been reviewed by Barry Smith and colleagues, who found it somewhat lacking.

http://ontology.buffalo.edu/medo/NCIT.pdf


"RESULTS: We found many mistakes and inconsistencies 
with respect to the term-formation principles used, 
the underlying knowledge representation system,
and missing or inappropriately assigned verbal and 
formal definitions.."
Ceusters W, Smith B, Goldberg L.
A terminological and ontological analysis of the 
NCI Thesaurus. Methods Inf Med. 2005;44(4):498-507.

My question is, "If the Thesaurus contains many different knowledge domains (medications, general diseases, neoplasms, etc.) how can it adequately cover all of its constituent domains?" In the neoplasm domain, it is missing many thousands of names of neoplasms. The terminology may be sufficient for its intended purpose (meeting the needs of the NCI community), but because the terminology is not comprehensive, the NCI Thesaurus will not necessarily serve those who want a thesaurus that comes close to including the names of ALL neoplasms.

Also, there doesn't seem to be any single organizing principle for the neoplasm domain. Some neoplasms are subclassed by their anatomic site (e.g. urinary tract neoplasm). Others are subclassed by their tissue type (e.g. soft tissue neoplasm). And so on. This is allowable under an ontology, so long as the ontology maintains consistency and competence (ability to answer questions about the members of classes). But I wonder if this is the best way of organizing tumors. Of course, I'm deeply biased. The Developmental Lineage Classification and Taxonomy of Neoplasms has a single organizing principle.

The NCI Thesaurus is an impressive piece of work and definitely worth looking over.

tags: biomedical informatics, cancer, classification, nomenclature, thesaurus, vocabulary, ontology, rare diseases, orphan drugs, genetics of disease, pathology, common diseases, complex diseases

Thursday, October 25, 2007

New schema for the Neoplasm Classification

The Developmental Lineage Classification and Taxonomy of Neoplasms first came out in 2003, and I've been making revisions and updates since then. Most of the work has involved adding new names of neoplasms, but this month I've made a change to the basic organization of the classification.

The schema is summarized here:

There are now six major classes, under the root class, Neoplasm


Endoderm/Ectoderm
Mesoderm
Neuroectoderm
Neural Crest
Germ cell
Trophectoderm

Every neoplasm falls under one of the six major classes.

The rationale of the classification is that tumors inherit key cellular pathways through their developmental lineages. This assertion is supported by decades of morphologic evaluations of tumors. More recently, molecular biological observations have shown that genetic markers and pathways are carried through cell lineage. Tumors grouped by cell lineage may share responses to new chemotherapeutic and chemopreventive agents targeted to specific pathways. If this is true, we can start to develop agents (and combinations of agents) that are effective against groups of neoplasms that share a common developmental lineage.

This theme has been developed in several of my early papers. The most popular paper, which has had many thousands of downloads, is:

Tumor classification: molecular analysis meets Aristotle

The complete classified taxonomy is available as a gzipped XML file at:

http://www.julesberman.info/neoclxml.gz

The taxonomy contains the names of over 5,000 different neoplasms, and about 130,000 synonymous terms. It is the most comprehensive listing of neoplasms in the world, and it is distributed under the GNU Free Documentation License.

-Jules J. Berman

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: common disease, orphan disease, orphan drugs, rare disease, subsets of disease, disease genetics, genetics of complex disease, genetics of common diseases, cryptic disease

Wednesday, October 24, 2007

Merging and splitting pdf images while preserving image annotations

A nice feature of pdf files is the ability to merge files or even parts of files. In the past week, we have discussed the free command-line utility, pdftk. The pdftk utility manages pdf files and is available at:

http://www.accesspdf.com/pdftk

The pdftk utility permits you to easily merge collections of pdf files. If you have converted your images to pdf files (something easily done with Ruby and discussed in several recent blog posts), you can make a large pdf file with each page of the file corresponding to a different image. If you are a pathologist who wants to send a set of images for consultation to another pathologist, you can combine all of the pertinent images into a single pdf file and send it to a colleague.

The message stressed in almost every entry of this Specified Life blog is that data has no value unless it is fully described. Furthermore, the description of the data must be bound to the data object. There are lots of ways of achieving this goal. In the case of pdf images, descriptive data can be added to the image header or descriptive data can be added to the image as an attachment file.

What happens to the attachment files for pdf images when you combine images into a merged document? Do you lose your attachments?

No. The descriptive attachments are preserved and can be reclaimed from the merged file. There are actually several different ways of reclaiming attached files from a merged pdf document. In this blog, we'll describe just one method, using pdftk.

First, we'll prepare two pdf image files for which textual attachments have been added (technique discussed in a prior blog). We'll put these into the c:\pdftk subdirectory, where we keep our pdftk exe files.

10/23/2007 08:00 PM 30,101 out_att.pdf
10/24/2007 07:07 AM 125,845 submu_at.pdf

We'll combine all of our pdf image files in the subdirectory using pdftk's cat command.

C:\pdftk>pdftk *.pdf cat output combined.pdf

This produces a multipage file, combined.pdf, with each image occupying a page of the file.

Next, we'll split the file, combined.pdf into pages, with each page representing a new pdf file containing a single image (using pdftk's burst command).

C:\pdftk>pdftk combined.pdf burst

Now, when we look at all of the .pdf files in our subdirectory, we see the two files that we started with (out_att.pdf and submu_at.pdf) plus the file that merges the two image files, combined.pdf, plus the two files that represent the split pieces of the file combined.pdf, produced by the burst command (pg_0001.pdf and pg_0002.pdf).


C:\pdftk>dir *.pdf
 Volume in drive C has no label.
 Volume Serial Number is 504C-8FB6

 Directory of C:\pdftk

10/24/2007  07:09 AM           155,685 combined.pdf
10/23/2007  08:00 PM            30,101 out_att.pdf
10/24/2007  07:11 AM            30,251 pg_0001.pdf
10/24/2007  07:11 AM           125,990 pg_0002.pdf
10/24/2007  07:07 AM           125,845 submu_at.pdf
               5 File(s)        467,872 bytes
               0 Dir(s)  30,727,413,760 bytes free

Can we reclaim the attachments for these pdf figures? Yes.

We can retrieve the attachment file by using pdftk's unpack_files command on any of the burst files.

Here, we extract the attachment to pg_0002.pdf

C:\pdftk>pdftk pg_0002.pdf unpack_files output

The new files produced by unpacking the attachment is submu_at.txt, the textual attachement originally added to the file submu_at.pdf.


C:\pdftk>dir *.txt
 Volume in drive C has no label.
 Volume Serial Number is 504C-8FB6

 Directory of C:\pdftk
10/24/2007  07:18 AM               190 SUBMU_AT.TXT

We can read the contents of submu_at.txt with the DOS type command.


C:\pdftk>type submu_at.txt
This is the descriptive text that I have attached 
to an image of submucosa from Gray's Anatomy, and 
copied from the Wikipedia Commons at: 
http://en.wikipedia.org/wiki/Image:Gray1033.png

This seems like a lot of work, and it is. Now that we know the individual steps for creating pdf image attachment files, merging pdf images, reclaiming the attachments, and displaying the image and its description, we can write scripts (in Perl, Python, or Ruby) that automate all of these steps (by making system calls to pdftk) and that scale up to more ambitious efforts.

Jules J. Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published last year by Morgan Kaufmann.

There are three crucial topics related to data preparation that are omitted from virtually every other Big Data book: identifiers, immutability, and introspection.

I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining

Tuesday, October 23, 2007

Adding an annotative text file to a pdf image

Earlier this week (10/21/2007), I described how you can add textual key/value image descriptors to the header of a pdf file.

If you like, you can add entire documents to a pdf file using the pdftk utility.

I described the pdftk utility in an earlier blog this week, but you can go directly to the pdftk site for a free download:

http://www.accesspdf.com/pdftk

We'll work from the pdftk subdirectory in Windows, and we'll use the pdf image out.pdf.

As an attachment, we'll add the neo2.dot file, which happens to contain the GraphViz specification for the out.pdf image (we could have used any file).

We put pdftk in the c:\pdftk\ subdirectory.

We enter the command line:

C:\pdftk>pdftk out.pdf attach_files neo2.dot to_page 1 output out_att.pdf

The pdftk utility will attach the text file, neo2.dot, to page 1 of the pdf image file, out.pdf, creating a new file, out_att.pdf. When we view the out_att.pdf file, it shows the same image as the out.pdf file, but the file contains a hidden attachment file, neo2.dot.

We can now delete the neo2.dot file

C:\pdftk>del neo2.dot

We can extract the neo2.dot file from the out_att.pdf file using the following pdftk command line.

C:\pdftk>pdftk out_att.pdf unpack_files output

The default output file is the attachment file, and it will have its original name, neo2.dot.

We can verify by printing out the contents of neo2.dot


C:\pdftk>type neo2.dot
digraph G {
 size="10,16";
 ranksep="1.90";
 node [style=filled color=gray65];
 Neoplasm [label="Neoplasm"];
 node [style=filled color=lightgray];
 EndodermEctoderm
    [label="Endoderm\/\nEctoderm"];
 NeuralCrest [label="Neural Crest"];
 GermCell [label="Germ cell"];
 Neoplasm -> EndodermEctoderm;
 Neoplasm -> Mesoderm;
 Neoplasm -> GermCell;
 Neoplasm -> Trophectoderm;
 Neoplasm -> Neuroectoderm;
 Neoplasm -> NeuralCrest;
and so on

In summary, any image can be converted to a PDF file, and the descriptive text for the image can be added as an attachment to the file. You can send the file to a colleague knowing that the image file conveys textual descriptors that you have provided. At any time, the textual descriptors can be extracted from the pdf file.

- Jules Berman

Monday, October 22, 2007

Adding and extracting comments to a Tiff image, with Ruby

This blog is part of a series describing methods to annotate images. The purposes of image annotation is to convey important (and often necessary) textual information inside the header of your image.

The basic utilities for image display/manipulation in Ruby are posted.

For this blog, we will use the .tif version of a public domain image of submucosa selected from Gray's Anatomy and downloaded from the Wikipedia Commons.

The Ruby script tifadd.rb adds a simple comment, "Hello world" to the
tif file submu_bw.tif.


#!/usr/local/bin/ruby
require 'RMagick'
include Magick
tissue = ImageList.new("submu_bw.tif")
tissue.cur_image[:Comment] = "Hello world"
tissue_copy = ImageList.new
tissue_copy = tissue.cur_image.copy
tissue_copy.write("submu_2.tif")
exit

The Ruby script is invoked from the command line.


c:\>ruby tifadd.rb

That's all there is to it. The comment can be extracted from the new .tif file, submu_2.tif, with the Ruby script tifout.rb.


#!/usr/local/bin/ruby
require 'RMagick'
include Magick
tissue = ImageList.new("submu_2.tif")
print tissue.properties['Comment']
exit

Here is the output.


c:\ftp>ruby tifout.rb
Hello world

It's very simple, but this technique allows you to start annotating your .tif files with text.

- Jules Berman

Sunday, October 21, 2007

Annotating a PDF image

I have been writing a series of posts on image annotation. Basically, images should convey textual information that describes the image and that describes the image file (the title of the image, who made the image, who can use the image file, etc.).

Prior posts explained how to do this with jpg and png images. If you transfrom a png image into a jpg image, text in the comment block in the png image header is preserved in the comment block of the jpg image header. However, if you transform a png image into a pdf image, the comment block is lost.

In this post, I'll show how to fetch the textual descriptors from a pdf file, and how to add your own textual descriptors to the pdf file. The methodology is fully described in Sid Steward's excellent book, "PDF Hacks: 100 industrial-strength tips & Tools."

First, just about any image in any popular format, can be converted into a PDF document. Here's a Ruby script that takes the png image, neo1.png, and converts it to the pdf file, out.pdf. More information about Ruby and RMagick are available.



#!/usr/local/bin/ruby
require 'RMagick'
include Magick
walnut = ImageList.new("neo1.png")
walnut_copy = ImageList.new
walnut_copy = walnut.cur_image.copy
walnut_copy.write("out.pdf")
exit

Now, if you want to see the text header in the pdf file, get pdftk (free).

http://www.accesspdf.com/pdftk/

If you have Windows, you can download the binary exe. For example:

pdftk-1.12.exe.zip (1,470,142 bytes)

Unzip the file and put the output into c:\pdftk

It will look something like:



C:\pdftk>dir
..
11/08/2004  03:11 PM            39,049 pdftk.1.html
11/08/2004  03:11 PM            17,878 pdftk.1.txt
11/08/2004  03:34 PM         1,489,920 pdftk.exe

Put your pdf image file in the c:\pdftk directory

Create an output file, outd.txt from your out.pdf image, with the following command line:



C:\pdftk>pdftk out.pdf dump_data output outd.txt

Then look at the outd.txt file:


C:\pdftk>type outd.txt
InfoKey: Title
InfoValue: out.pdf
InfoKey: Producer
InfoValue: ImageMagick 6.2.9 08/11/06 Q8 
              http://www.imagemagick.org
InfoKey: ModDate
InfoValue: D:20071021080307
InfoKey: CreationDate
InfoValue: D:20071021080307
NumberOfPages: 1

Every pdf file has important textual descriptors, including the Creation date.

Now, let's go into a text editor and add values to the outd.txt file. We add four lines (bottom) that describe the copyright and the usage for the image.



InfoKey: Title
InfoValue: out.pdf
InfoKey: Producer
InfoValue: ImageMagick 6.2.9 08/11/06 Q8 
              http://www.imagemagick.org
InfoKey: ModDate
InfoValue: D:20071021080307
InfoKey: CreationDate
InfoValue: D:20071021080307
InfoKey: Copyright
InfoValue: Copyright (C) 2007 Jules J. Berman
InfoKey: Usage
InfoValue: GNU Free Documentation License
NumberOfPages: 1

Now, let's upload this revised descriptor file into the pdf image.



C:\pdftk>pdftk out.pdf update_info outd.txt output outd.pdf

We now have a new pdf file, outd.pdf, with the revised descriptor list included in the pdf file header. Let's check. Write a dump_data command line to extract the text descriptors of the outd.pdf file.



C:\pdftk>pdftk outd.pdf dump_data output outd2.txt

Then print out the outd2.txt file containing the text descriptors for outd.pdf



C:\pdftk>type outd2.txt
InfoKey: Title
InfoValue: out.pdf
InfoKey: Producer
InfoValue: ImageMagick 6.2.9 08/11/06 Q8 
              http://www.imagemagick.org
InfoKey: Copyright
InfoValue: Copyright (C) 2007 Jules J. Berman
InfoKey: ModDate
InfoValue: D:20071021080307
InfoKey: Usage
InfoValue: GNU Free Documentation License
InfoKey: CreationDate
InfoValue: D:20071021080307
NumberOfPages: 1

Notice that the InfoKey/InfoValue pairs are rearranged, but the information that we added (copyright statement and usage statement) are included.

So, how do you preserve header information when you transform a png or a jpg file to a pdf file? I'm sure you can figure this out for yourself, but just for completeness sake, I hope to write some future blogs that provide techniques and examples.

- Jules Berman

Saturday, October 20, 2007

More Ruby image header and image format tricks

This blog is just an elaboration on an earlier blog. Again, it uses RMagick and ImageMagick. Details for using these within Ruby have been posted.

Here's a script, pdfadd.rb, that takes a png image file and creates a jpg image file with a text comment.



#!/usr/local/bin/ruby
require 'RMagick'
include Magick
walnut = ImageList.new("neo1.png")
walnut.cur_image[:Comment] =
 " Developmental Lineage
 Classification
 of Neoplasms
 Copyright (C) 2007
 by Jules J. Berman
 and distributed under
 the GNU Free
 Documentation License"
walnut_copy = ImageList.new
walnut_copy = walnut.cur_image.copy
walnut_copy.write("out.jpg")
exit

Once the comment has been put into the jpg file, we can extract it with another Ruby script, pdf2.rb.



#!/usr/local/bin/ruby
require 'RMagick'
include Magick
walnut = ImageList.new("out.jpg")
print walnut.properties['Comment']
exit

The output looks like this:



c:\ftp>ruby pdf2.rb
 Developmental Lineage
 Classification
 of Neoplasms
 Copyright (C) 2007
 by Jules J. Berman
 and distributed under
 the GNU Free
 Documentation License

We can also directly transform one image type into another. Here, we will transform a png image into a pdf image.



#!/usr/local/bin/ruby
require 'RMagick'
include Magick
walnut = ImageList.new("neo1.png")
walnut_copy = ImageList.new
walnut_copy = walnut.cur_image.copy
walnut_copy.write("out.pdf")
exit

As a caveat, when you transform an image between different formats, your comment text may be lost if the receiving format doesn't recognize the attribute object. So you can transfer Comment text from png to jpg, but you can't directly transfer the comment over to pdf.

- Jules Berman

Friday, October 19, 2007

Specifying a classification with GraphViz

I made this graph with GraphViz (a freely available software application).

In my prior post, I used this same figure, which displays the entire Neoplasm Classification Schema, in a graphic form.

Here's how to use GraphViz to display a classification or ontology.

The GraphViz download site is:

http://www.graphviz.org/Download.php

Windows users can download graphviz-2.14.1.exe (5,614,329 bytes).

You can install the software by running the .exe file.

GraphViz has its own language, in which you list the relationships among the different classes, and then it builds a graphic view of the classification.

GraphViz has sub-applications:dot, fdp, twopi, neato, and circo. The twopi application, which I used, creates graphs that have a radial layout.



digraph G {
 size="10,16";
 ranksep="1.75";
 node [style=filled color=gray65];
 Neoplasm [label="Neoplasm"];
 node [style=filled color=lightgray];
 EndodermEctoderm 
    [label="Endoderm\/\nEctoderm"];
 NeuralCrest [label="Neural Crest"];      
 GermCell [label="Germ cell"];
 Neoplasm -> EndodermEctoderm;
 Neoplasm -> Mesoderm;
 Neoplasm -> GermCell;
 Neoplasm -> Trophectoderm;
 Neoplasm -> Neuroectoderm;
 Neoplasm -> NeuralCrest;
 node [style=filled color=gray95];
 Trophectoderm -> Molar;
 Trophectoderm -> Trophoblast;
 EndodermEctoderm -> Odontogenic;
 EndodermEctodermPrimitive 
    [label="Endoderm\/Ectoderm\nPrimitive"];
 EndodermEctoderm -> EndodermEctodermPrimitive;
 Endocrine 
    [label="Endoderm/Ectoderm\nEndocrine"];
 EndodermEctoderm -> Endocrine;
 EndodermEctoderm -> Parenchymal;
 Odontogenic 
    [label="Endoderm/Ectoderm\nOdontogenic"];
 EndodermEctoderm -> Surface;
 MesodermPrimitive 
    [label="Mesoderm\nPrimitive"];
 Mesoderm -> MesodermPrimitive;
 Mesoderm -> Subcoelomic;
 Mesoderm -> Coelomic;
 NeuroectodermPrimitive 
   [label="Neuroectoderm\nPrimitive"];
 NeuroectodermNeuralTube 
   [label="Central Nervous\nSystem"];
 Neuroectoderm -> NeuroectodermPrimitive;
 Neuroectoderm -> NeuroectodermNeuralTube;
 NeuralCrestMelanocytic 
   [label="Melanocytic"];
 NeuralCrestPrimitive 
   [label="Neural Crest\nPrimitive"];
 NeuralCrestEndocrine 
   [label="Neural Crest\nEndocrine"];
 PeripheralNervousSystem 
   [label="Peripheral\nNervous System"];
 NeuralCrestOdontogenic 
   [label="Neural Crest\nOdontogenic"];
 NeuralCrest -> NeuralCrestPrimitive;
 NeuralCrest -> PeripheralNervousSystem;
 NeuralCrest -> NeuralCrestEndocrine;
 NeuralCrest -> NeuralCrestMelanocytic;
 NeuralCrest -> NeuralCrestOdontogenic;
 GermCell -> Differentiated;
 GermCell -> Primordial;
}

You create your the image neo2.png, from the neo2.dot specification (above) by invoking the twopi subapplication on a command line.

c:\ftp>twopi -Tpng neo2.dot -o neo2.png

-Jules Berman tags: classification, directed graph, graph, ontology,

Adding a comment to a PNG or JPG header with Ruby and RMagick

Whenever you create and distribute an image, you should consider putting information in the header, such as the image title and your name and usage restrictions (if any). If you're putting the image into the public domain, you should specify that the image is a public domain image.

These issues are discussed in depth in my web page, Implementing an RDF Schema for Pathology Images.

The header of an image is a text block that is not visible when the image is viewed, but which can easily be stripped from the image.

Here is are two short Ruby scripts. The first script enters a comment into an image header. The second script extracts and displays the comment.


script pdfadd.rb
#!/usr/local/bin/ruby
require 'RMagick'
include Magick
walnut = ImageList.new("neo1.png")
walnut.cur_image[:Comment] = "Developmental Lineage
                            Classification
                            of Neoplasms
                            Copyright (C) 2007
                            by Jules J. Berman
                            and distributed under
                            the GNU Free
                            Documentation License"
walnut_copy = ImageList.new
walnut_copy = walnut.cur_image.copy
walnut_copy.write("out.jpg")
exit


script pdf2.rb
#!/usr/local/bin/ruby
require 'RMagick'
include Magick
walnut = ImageList.new("out.jpg")
walnut.properties{|name, value|
                 print "#{name} #{value}\n"}
exit

This scripts require RMagick, and instructions for acquiring RMagick are available.

The first script, pdfadd.rb, adds a comment to an image and copies the png image to
a jpg image.

The second script, pdf2.rb, extracts and prints the comment from the jpeg image.

The output of the second script is:



C:\ftp>pdf2.rb
Comment Developmental Lineage
                            Classification
                            of Neoplasms
                            Copyright (C) 2007
                            by Jules J. Berman
                            and distributed under
                            the GNU Free
                            Documentation License
JPEG-Colorspace 1
JPEG-Sampling-factors 1x1

The image, neo1.png, is:

I urge you to read more about my book. There's a generous preview of the book at the Google Books site. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

- Jules J. Berman, Ph.D., M.D. tags: common disease, orphan disease, orphan drugs, rare disease, disease genetics

Wednesday, October 17, 2007

Updates to my web site

I've recently made updates to several old files on my website.

They are:

2003 Letter to Human Pathology regarding precancers

2004 Letter to Human Pathology regarding precancers

2004 Editorial to Am J Clinical Pathology regarding data sharing in pathology

2001 List of 12,000+ medical abbreviations

2003 Neoplasm Classification gzipped file

- Jules Berman

Tuesday, October 9, 2007

Ruby scripts from Ruby Programming for Medicine and Biology

For those interested, here is a list of Ruby scripts that are included and described in Ruby programming for Medicine and Biology, my book that was published last month (September, 2007).

SCRIPTS IN RUBY PROGRAMMING FOR BIOLOGY AND MEDICINE

1.3.1. Getinput.rb retrieves a line of keyboarded text.

1.4.1. Grow.rb simulates six generations of bacterial growth.

1.4.3. Grow2.rb uses explicit Ruby objects and statements.

1.5.1. Lowclass.rb provides syntax for user-created classes.

1.5.3. Noclass.rb requires an external Person class definition.

1.5.4. Person_class_file.rb, a class library for script no_class.rb.

2.7.1. Combo.rb parses an array into all possible ordered subarrays.

2.8.1. Hash.rb creates and displays key/value pairs for a Hash instance object.

2.8.4. Neohash.rb creates three hash instances for the Neoplasm Classification.

2.10.1. Glob.rb demonstrated the Dir class glob method.

2.10.3. Dirlist.rb lists the files in the current directory.

2.12.1. Time.rb measures the length of time for any process.

3.4.2. Mod1.rb defines and includes a simple module.

3.4.3. Mod2.rb calls a module with the scope operator.

3.4.4. Mod3.rb embeds a module within a class.

6.2.1. Readsome.rb reads the first 20 lines of the MRCONSO file.

6.3.2. Zipf.rb prints the number of occurrences of words in a string.

6.3.4. Zipf2.rb creates a Zipf distribution of the words in OMIM.

6.4.1. Snom_get.rb extracts SNOMED-CT terms from UMLS.

6.5.1. Disease.rb collects SNOMED-CT diseases from UMLS.

6.6.1. Neosdbm.rb creates three persistent database objects.

6.7.1. Sdbmget.rb retrieves data from persistent object.

7.3.1. Sentence.rb, a simple sentence parser.

7.6.1. Search.rb searches through any file for lines matching a Regex expression.

7.7.1. Pubemail.rb extracts e-mail addresses from a PubMed search.

8.3.1. Base64.rb encodes strings in Base64 notation.

8.4.1. Dircopy.rb copies files from one directory to another.

8.5.1. Dcm2jpg.rb converts a DICOM file into a jpeg file.

8.5.3. Dcmsplit.rb converts a DICOM file into a jpeg and a text file.

8.6.1. Jpeg_add.rb inserts textual information into a jpeg image.

9.2.1. Concord.rb creates a concordance.

9.3.1. Indexer.rb creates an index.

10.2.1. Haystack.rb performs a binary search on a file.

10.3.2. Tinysort.rb sorts the lines of a file.

10.3.4. Bigsort.rb sorts large files quickly.

10.4.1. Anatomy.rb extracts SQL data from the Functional Model of Anatomy.

10.5.2. Alldata.rb sums the census districts to yield the total U.S. population.

11.2.1. Scrubit.rb scrubs and deidentifies any input line.

12.2.2. Autocode.rb provides nomenclature terms and codes for an input sentence.

12.3.1. Fastcode.rb improves performance compared with aucode.rb.

12.4.3. Icd.rb collects ICD10AM codes from the UMLS Metathesaurus.

12.5.1. Seer.rb determines the occurrences in the U.S. of tumor types found in the SEER public-use data files.

13.5.1. Fibo.rb computes the first twenty elements of the Fibonacci series.

13.6.1. Mean.rb computes the mean from an array of numbers.

13.7.1. Std_dev.rb computes the standard deviation for an array of numbers.

13.8.1. Randtest.rb simulates 600,000 casts of the die.

13.10.1. Error.rb uses resampling to simulate runs of errors.

14.6.2. Thresh.rb divides a text file into two threshold files.

14.6.5. Threshrv.rb computes original file from two threshold files.

15.2.2. Neopull.rb searches a server file for a web client query.

15.3.1. Neosafe.rb improves the security of neopull.rb.

17.3.1. Biohack.rb, converts a gene sequence into a protein sequence.

18.14.2. Rdf3.rb extracts triples from an RDF document.

18.19.2. Jpg2b64 inserts a Base64 jpeg image into an RDF file.

19.2.2. Ancestor.rb determines the ancestor lineage for organisms.

19.2.4. Class_lineage.rb determines ancestral lineage.

19.3.1. Neoself.rb provides tag hierarchy for XML file.

20.5.1. Conflict.rb overrides a class assignment.

-Jules Berman tags: bioinformatics, biomedical informatics, medical informatics, Ruby, ruby language, Ruby programming, scripts