Wednesday, October 24, 2007

Merging and splitting pdf images while preserving image annotations

A nice feature of pdf files is the ability to merge files or even parts of files. In the past week, we have discussed the free command-line utility, pdftk. The pdftk utility manages pdf files and is available at:

http://www.accesspdf.com/pdftk

The pdftk utility permits you to easily merge collections of pdf files. If you have converted your images to pdf files (something easily done with Ruby and discussed in several recent blog posts), you can make a large pdf file with each page of the file corresponding to a different image. If you are a pathologist who wants to send a set of images for consultation to another pathologist, you can combine all of the pertinent images into a single pdf file and send it to a colleague.

The message stressed in almost every entry of this Specified Life blog is that data has no value unless it is fully described. Furthermore, the description of the data must be bound to the data object. There are lots of ways of achieving this goal. In the case of pdf images, descriptive data can be added to the image header or descriptive data can be added to the image as an attachment file.

What happens to the attachment files for pdf images when you combine images into a merged document? Do you lose your attachments?

No. The descriptive attachments are preserved and can be reclaimed from the merged file. There are actually several different ways of reclaiming attached files from a merged pdf document. In this blog, we'll describe just one method, using pdftk.

First, we'll prepare two pdf image files for which textual attachments have been added (technique discussed in a prior blog). We'll put these into the c:\pdftk subdirectory, where we keep our pdftk exe files.

10/23/2007 08:00 PM 30,101 out_att.pdf
10/24/2007 07:07 AM 125,845 submu_at.pdf

We'll combine all of our pdf image files in the subdirectory using pdftk's cat command.

C:\pdftk>pdftk *.pdf cat output combined.pdf

This produces a multipage file, combined.pdf, with each image occupying a page of the file.

Next, we'll split the file, combined.pdf into pages, with each page representing a new pdf file containing a single image (using pdftk's burst command).

C:\pdftk>pdftk combined.pdf burst

Now, when we look at all of the .pdf files in our subdirectory, we see the two files that we started with (out_att.pdf and submu_at.pdf) plus the file that merges the two image files, combined.pdf, plus the two files that represent the split pieces of the file combined.pdf, produced by the burst command (pg_0001.pdf and pg_0002.pdf).


C:\pdftk>dir *.pdf
Volume in drive C has no label.
Volume Serial Number is 504C-8FB6

Directory of C:\pdftk

10/24/2007 07:09 AM 155,685 combined.pdf
10/23/2007 08:00 PM 30,101 out_att.pdf
10/24/2007 07:11 AM 30,251 pg_0001.pdf
10/24/2007 07:11 AM 125,990 pg_0002.pdf
10/24/2007 07:07 AM 125,845 submu_at.pdf
5 File(s) 467,872 bytes
0 Dir(s) 30,727,413,760 bytes free


Can we reclaim the attachments for these pdf figures? Yes.

We can retrieve the attachment file by using pdftk's unpack_files command on any of the burst files.

Here, we extract the attachment to pg_0002.pdf

C:\pdftk>pdftk pg_0002.pdf unpack_files output

The new files produced by unpacking the attachment is submu_at.txt, the textual attachement originally added to the file submu_at.pdf.


C:\pdftk>dir *.txt
Volume in drive C has no label.
Volume Serial Number is 504C-8FB6

Directory of C:\pdftk
10/24/2007 07:18 AM 190 SUBMU_AT.TXT


We can read the contents of submu_at.txt with the DOS type command.


C:\pdftk>type submu_at.txt
This is the descriptive text that I have attached
to an image of submucosa from Gray's Anatomy, and
copied from the Wikipedia Commons at:
http://en.wikipedia.org/wiki/Image:Gray1033.png


This seems like a lot of work, and it is. Now that we know the individual steps for creating pdf image attachment files, merging pdf images, reclaiming the attachments, and displaying the image and its description, we can write scripts (in Perl, Python, or Ruby) that automate all of these steps (by making system calls to pdftk) and that scale up to more ambitious efforts.

Jules J. Berman

My book, Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information was published last year by Morgan Kaufmann.



There are three crucial topics related to data preparation that are omitted from virtually every other Big Data book: identifiers, immutability, and introspection.

I urge you to read more about my book. Google books has prepared a generous preview of the book contents. If you like the book, please request your librarian to purchase a copy of this book for your library or reading room.

tags: big data, metadata, data preparation, data analytics, data repurposing, datamining, data mining