Sunday, October 21, 2007

Annotating a PDF image

I have been writing a series of posts on image annotation. Basically, images should convey textual information that describes the image and that describes the image file (the title of the image, who made the image, who can use the image file, etc.).

Prior posts explained how to do this with jpg and png images. If you transfrom a png image into a jpg image, text in the comment block in the png image header is preserved in the comment block of the jpg image header. However, if you transform a png image into a pdf image, the comment block is lost.

In this post, I'll show how to fetch the textual descriptors from a pdf file, and how to add your own textual descriptors to the pdf file. The methodology is fully described in Sid Steward's excellent book, "PDF Hacks: 100 industrial-strength tips & Tools."

First, just about any image in any popular format, can be converted into a PDF document. Here's a Ruby script that takes the png image, neo1.png, and converts it to the pdf file, out.pdf. More information about Ruby and RMagick are available.


#!/usr/local/bin/ruby
require 'RMagick'
include Magick
walnut = ImageList.new("neo1.png")
walnut_copy = ImageList.new
walnut_copy = walnut.cur_image.copy
walnut_copy.write("out.pdf")
exit


Now, if you want to see the text header in the pdf file, get pdftk (free).

http://www.accesspdf.com/pdftk/

If you have Windows, you can download the binary exe. For example:

pdftk-1.12.exe.zip (1,470,142 bytes)

Unzip the file and put the output into c:\pdftk

It will look something like:


C:\pdftk>dir
..
11/08/2004 03:11 PM 39,049 pdftk.1.html
11/08/2004 03:11 PM 17,878 pdftk.1.txt
11/08/2004 03:34 PM 1,489,920 pdftk.exe


Put your pdf image file in the c:\pdftk directory

Create an output file, outd.txt from your out.pdf image, with the following command line:


C:\pdftk>pdftk out.pdf dump_data output outd.txt

Then look at the outd.txt file:

C:\pdftk>type outd.txt
InfoKey: Title
InfoValue: out.pdf
InfoKey: Producer
InfoValue: ImageMagick 6.2.9 08/11/06 Q8
http://www.imagemagick.org
InfoKey: ModDate
InfoValue: D:20071021080307
InfoKey: CreationDate
InfoValue: D:20071021080307
NumberOfPages: 1


Every pdf file has important textual descriptors, including the Creation date.

Now, let's go into a text editor and add values to the outd.txt file. We add four lines (bottom) that describe the copyright and the usage for the image.


InfoKey: Title
InfoValue: out.pdf
InfoKey: Producer
InfoValue: ImageMagick 6.2.9 08/11/06 Q8
http://www.imagemagick.org
InfoKey: ModDate
InfoValue: D:20071021080307
InfoKey: CreationDate
InfoValue: D:20071021080307
InfoKey: Copyright
InfoValue: Copyright (C) 2007 Jules J. Berman
InfoKey: Usage
InfoValue: GNU Free Documentation License
NumberOfPages: 1


Now, let's upload this revised descriptor file into the pdf image.


C:\pdftk>pdftk out.pdf update_info outd.txt output outd.pdf


We now have a new pdf file, outd.pdf, with the revised descriptor list included in the pdf file header. Let's check. Write a dump_data command line to extract the text descriptors of the outd.pdf file.


C:\pdftk>pdftk outd.pdf dump_data output outd2.txt


Then print out the outd2.txt file containing the text descriptors for outd.pdf


C:\pdftk>type outd2.txt
InfoKey: Title
InfoValue: out.pdf
InfoKey: Producer
InfoValue: ImageMagick 6.2.9 08/11/06 Q8
http://www.imagemagick.org
InfoKey: Copyright
InfoValue: Copyright (C) 2007 Jules J. Berman
InfoKey: ModDate
InfoValue: D:20071021080307
InfoKey: Usage
InfoValue: GNU Free Documentation License
InfoKey: CreationDate
InfoValue: D:20071021080307
NumberOfPages: 1


Notice that the InfoKey/InfoValue pairs are rearranged, but the information that we added (copyright statement and usage statement) are included.

So, how do you preserve header information when you transform a png or a jpg file to a pdf file? I'm sure you can figure this out for yourself, but just for completeness sake, I hope to write some future blogs that provide techniques and examples.

- Jules Berman


Science is not a collection of facts. Science is what facts teach us; what we can learn about our universe, and ourselves, by deductive thinking. From observations of the night sky, made without the aid of telescopes, we can deduce that the universe is expanding, that the universe is not infinitely old, and why black holes exist. Without resorting to experimentation or mathematical analysis, we can deduce that gravity is a curvature in space-time, that the particles that compose light have no mass, that there is a theoretical limit to the number of different elements in the universe, and that the earth is billions of years old. Likewise, simple observations on animals tell us much about the migration of continents, the evolutionary relationships among classes of animals, why the nuclei of cells contain our genetic material, why certain animals are long-lived, why the gestation period of humans is 9 months, and why some diseases are rare and other diseases are common. In “Armchair Science”, the reader is confronted with 129 scientific mysteries, in cosmology, particle physics, chemistry, biology, and medicine. Beginning with simple observations, step-by-step analyses guide the reader toward solutions that are sometimes startling, and always entertaining. “Armchair Science” is written for general readers who are curious about science, and who want to sharpen their deductive skills.