PDF_Blog

Mark Gavin's PDFblog

This blog contains items I find interesting or useful primarily related to Portable Document Format (PDF). 

PDF/UA is now ISO/AWI 14289

The PDF Universal Accessibility working group has received an ISO "Approved Work Item" number from the International Standards Organization (ISO).  PDF/UA is now ISO/AWI 14289.


As PDF/UA continues to more through the standards process; the eventual standard will be labeled ISO 14289.

PDF Language Module

We have developed a BBEdit Language Module for PDF.  This language module is written to aid developers and support personnel in reading raw PDF files by highlighting specific elements of the file with syntax coloring.



The meaning of the colors are as follows:

blue - PDF keywords

red - arrays

green - dictionaries

purple - strings

light gray - stream data


PDFHilight is a Universal Binary plug-in built for BBEdit 9.  

Download PDFHilight


Installation: To install the plug-in, make sure BBEdit isn't running, and place the plug-in in the following folder:

(your home directory)/Library/Application Support/BBEdit/Language Modules


If the "Language Modules" directory doesn't exist on your machine, you should create it.  PDFHilight will automatically add itself to the "Languages" pane of the BBEdit Preferences.


Usage: Once the plug-in is installed, PDF files opened in BBEdit will be displayed with syntax coloring to help locate certain elements in the structure.


Note: BBEdit defaults to displaying a text files with "Translate Line Breaks" enabled.  If you want to count bytes in a PDF file; you should disable "Translate Line Breaks" in the "Text Files" preferences pane.


Disclaimer: PDFHilight is provided unsupported "as-is".  Though, we welcome your comments and suggestions.

Object Streams

Recently I have received several PDF documents which contain compressed object streams.  Object Streams are described in section 7.5.7 of ISO-32000-1; and, are a mechanism of storing a collection of indirect Cos Objects together inside of a Cos Stream.  

This Cos Stream; of Type "ObjStm" may or may not be compressed.  Though, it would be pointless not to compress the stream.  

Object Streams became available starting with PDF 1.5; and, therein lies the reason for this blog posting.  Every one of the files with Object Streams, to cross my desk recently, has a version number of "PDF 1.4".

The purpose of the PDF Version Number is to specify what PDF features may be present in a given PDF file.  

When a PDF file contains Object Streams; the PDF Version number must be set to %PDF-1.5 or greater; otherwise, the PDF is malformed.  

So, why is properly setting the PDF version number important?

In a nutshell; properly setting the PDF version number improves the performance and reliability of software used to read the PDF file.

Update: December 22, 2008

Today I received yet another malformed PDF file which contains compressed object streams; but, in this case the version number on this particular file is set to 1.3.  

It appears there are a significant number of PDF developers who are simply not reading the PDF Reference.

PDF Standard Available

The unofficial ISO 32000-1 PDF Standard document is now available as a free download from Adobe. The body text of the document is the same as the official ISO 32000-1 standard; but, page headers and footers have been changed to replace the ISO copyright with the Adobe copyright.

ISO 32000-1

Portable Document Format (PDF) is now officially an international standard. The International Organization for Standards (ISO) has published ISO 32000-1:2008, Document management – Portable document format – Part 1: PDF 1.7.  This is an ISO standard based on PDF 1.7.  Here is a link to the ISO press release.

Work is currently underway for the development of ISO 32000-2.  To participate in the development of the next version of PDF; get involved.  In the United States; the PDF Reference Committee is managed by AIIM

Tools for Creating Acrobat Forms

Otherwise known as AcroForms; Acrobat form technology was first introduced in PDF version 1.2; and, has been around for more then ten years.  In addition to Adobe Acrobat; there are third parties which have released products to create Acrobat forms.  

Following is a list of tools to create AcroForms:

Adobe Acrobat Professional

Nuance PDF Converter Professional Versions 4 or 5

FoxIt Reader Form Designer

Amgraf OneForm Designer Plus

The Acrobat Professional package includes tools to create documents using two different forms technology; Acroforms using the form tools under Acrobat; and XFA using Adobe Form Designer.  Note: XFA is an XML based forms technology which in incompatible with AcroForms.

PDF Converter Professional includes a standard set of form layout tools very similar to Acrobat.  Nuance has built an excellent tool for automatically laying out form fields on scanned forms.  The software can look at a scanned image of a form, locate and place the form fields automatically; and, assign reasonable names to the newly created form fields.  I was suprised with how well the automatic layout tool worked.

FoxIt Software has recently created a plug-in to their FoxIt Reader product with a set of form layout tools similar to Acrobat.

OneForm Designer Plus is the professional forms layout tool used by the IRS to create the US tax forms. 

Free Software

With the opening of the new Appligent Online Store; Appligent has released two more free products; APSaveAs and APConductor. 

APSaveAs is a tool for cleaning up PDF files.  Its primary function is to perform a garbage collected save on a PDF file.  In addition, it will also correct many types of malformed and corrupt PDF files. 

APConductor is a stand alone SOAP server.  It can be used with Appligent applications; in addition, it can also we used to turn any CLI based application into a SOAP web service.

The two other free tools available are APStripFiles and APGetPageCount.

APStripFiles will remove embedded file from a PDF document.  Embedded files within PDF documents can contain Viruses or other malicious code.  APStripFiles can be used stand alone or in conjunction with an email server to remove embedded file from PDF file which are attached to incoming email.

APGetPageCount returns the number of pages with a PDF document.  It can also be used to get the total page count in all PDF files within a directory.

Also available on the Appligent web site is the Java source code for the APFDFGenerator; an FDF generator makes it easier for Java programmers to dynamically create FDF files.

Forms Data Format

A Forms Data Format ( FDF ) file is a text file that contains a list of form field names and their values. Acrobat Forms, or AcroForms, were introduced in PDF Version 1.2.  To allow for the import and export of data from AcroForms; Adobe developed the Forms Data Format.  The documentation for the Forms Data Format is located in the PDF Reference in the chapter on "Interactive Features" under the section "Interactive Forms".

There are two kinds of FDF files:

• Classic - supplies data to fill out an existing static form.

• Template - directs the construction of a new PDF document based on the templates found inside specified PDF files, and supplies the data to fill out the form(s) in the new document. This construction or assembly process is sometimes referred to as "spawning" new PDF pages.

Important features of an FDF file:

• An FDF file must begin with %FDF and end with %%EOF.

• The data is given as name-value pairs, also called key-value pairs :

Title - ( /T ) indicates form field name (i.e., Address 2, Name, Date)

Value - ( /V ) indicates form field value (i.e., 29 Communications Road, Suzie Smith, 24 January 2000)

• The pair is enclosed in double angle brackets: << >>

A basic FDF data file is shown below. Within it are nine name-value pairs. The lines before and after these pairs are identification and formatting information.

%FDF-1.2

%‚„oe”

1 0 obj

<< /FDF

<< /Fields

[

<< /V (Communications Co.)/T (Address1)>>

<< /V (29 Communications Road)/T (Address2)>>

<< /V (Busyville)/T (City)>>

<< /V (USA)/T (Country)>>

<< /V (24 January 2000)/T (Date)>>

<< /V (Suzie Smith)/T (Name)>>

<< /V (\(807\) 221-9999)/T (PhoneNumber)>>

<< /V (777-11-8888)/T (SSN)>>

<< /V (NJ)/T (State)>>

]

/F (TestForm.pdf)>>

>>

endobj

trailer

<<

/Root 1 0 R

>>

%%EOF

The FDF file specification structures FDF files in terms of objects. Objects are enclosed in double angle brackets: << >>. The objects of particular importance are as follows:

• Dictionary objects—collections of key/value pairs. Values can be any kind of object, including an array or another dictionary. The dictionary contents are enclosed in double angle brackets: <</dictionary << /key /value >> >>

• Array objects—collections of other objects, including dictionaries and other arrays. The content of the array object is enclosed in square brackets: <</array [contents]>>

For example; in the basic FDF file displayed above, the /Fields array contains nine dictionary objects.

Presenting Data and Information

Last week while in Boston for the AIIM Conference; I used the Monday before the conference to attend a one day course taught by Edward Tufte on "Presenting Data and Information".  The course focuses on effectively presenting and communicating information.

The course is given in various locations around the country throughout the year.  I've known about the course for the past several years; but, until last week the scheduling didn't work out to make it convenient for me to attend.

I found the course to be well researched, thought provoking and entertaining.  I would recommend it highly.

Following is a quote from Edward Tufte:

"Clutter and confusion are not attributes of information they are failures of design."

Acrobat 8 Crash FreeText Annotation

The following simple PDF document contains a single FreeText annotation.  The FreeText annotation is displayed correctly under both Acrobat 7 and 8.  However, using the mouse to click on the annotation under Acrobat 8, causes Acrobat 8 to crash.  FreeTextCrash.pdf

Below is a screen shot of the FreeText annotation labeled "COV".

freetext_screeshot

The crash occurs in Acrobat 8 on both Windows and Macintosh.

Clicking on the annotation under Acrobat 7 selects the annotation as expected.

HP Smart Document Scan

Recently we have received a couple of malformed PDF files produced by the HP Smart Document Scan software.  It appears that the HP Smart Document Scan software is only included with the HP Scanjet 7800 Scanner and the  HP Scanjet 8350 & 8380 scanners.

The version number of the PDF files produced is PDF 1.0. The first problem we found is located in /Name objects which contain '#xx' hex values.  The use of # hex values were not part of PDF 1.0;  # hex values in Name objects were introduced in PDF 1.2.

The second problem is fairly nasty.  Apparently, within the HP Smart Document Scan software the programmer is using the path to the image file as the Name of the image XObject when creating the PDF file.  On some systems; a file path can be quite long.  For long path names; the HP software simply truncates the path at 127 characters.  If the path contains a space character it is placed in the Name object as '#20'.  

The PDF 1.2 Reference states "Any character except null (<00>) may be included in a name by writing its two-character hex code, preceded by #."  It is permissible for hex strings contained within angle brackets (<>) to have an odd number of hex characters; but, # hex character within name objects must always be two hex characters.  

Because the extremely long XObject name uses a file path, coupled with truncating the file path at 127 characters; the resulting PDF file can sometimes be malformed.  Following is an example of what is sometimes produced:
 
/T:\Everyone\OXB#20TEST\FilePath\ ... ... \08\Feb#2020,#2

I have removed the center of the name object for clarity.  The name object above was truncated between two hex digits.  At the end of the name you can see '#2'; I'm assuming this should be '#20'. Because this is now an invalid hex character at the end of a name object; it produces the error "non-hex character in a hex string".

This is very easy to locate within the HP files by simply opening the file using a text editor.

When creating names for XObjects; it is common practice to use a short unique string followed by an incremental counter.  This results in each XObject in the file having a unique name.

Using the file path as the XObject name is a bad practice; especially where very long path names can be truncated.  This practice does not guarantee that all XObjects in the file will have unique names.

To avoid problems with the PDF files created by this software; end uses working with the HP Smart Document Scan software should ensure the paths to the scanned image files are far less the 127 characters.

PDF Linearization

Linearization

Linearization is a variant on the PDF file layout as described previously.  Linearization is also called "Fast Web View".  Linearization shuffles the contents of the PDF file to place all of the information needed to display the first page near the beginning of the file.

pastedgraphic-5_textmedium

This allows the user to see the first page while the remainder of the file is still downloading from the web.  

Incremental saves on a linearized file can actually break linearization; but, Acrobat still reports the file as enabled for "Fast Web View".  Before publishing to the web; make sure to do a "Save As" with linearization.

In addition; when creating a linearized PDF file; set the Document Properties - Initial View to Page Only.  Displaying Bookmarks, Thumbnails, etc. forces all of the extra information needed to display the sidebar to be sent down to the client along with the data for the first page.  

Developer Notes About Linearization

A correctly Linearized PDF file is fairly complex to write.  One of the basic problems is that Linearization is no properly documented in the Adobe PDF Reference.

Another problem is Linearization is difficult to test.  We at Appligent had to develop our own test web server to analyze the packet requests from Acrobat.


Jim King's Presentations

Jim King, Principal Scientist at Adobe Systems has a personal web site which contains a collection of his public presentations.  These presentations include PDF Tutorials, Color Management, Color Science, XML/PDF Tutorial and High Resolution Rendering.  Several of the presentations are annotated with speakers notes.  I would encourage everyone to check it out.  The URL to the presentations is as follows: http://home.comcast.net/~jk05/presentations/

ISO-PDF

The first meeting of the Portable Document Format (PDF) Reference Committee will be held in Silver Spring, MD on July 16 and 17, 2007.  The meeting location and agenda can be found on the AIIM web site using the above link.  In addition, the same web page also contains a link to the draft of the document submitted to AIIM by Adobe.

The official name of the proposed standard is expected to be ISO 32000.  The draft document submitted to AIIM by Adobe is 768 pages.  The PDF Reference 1.7 is 1310 pages.  According to Adobe the reduced number of pages is the result of using the ISO standard A4 paper size and removing some Adobe specific information.  

The draft document also contains additional sections not found in the current PDF Reference and changes to sections that Adobe "considered incomplete". 

The first page of the draft document is watermarked "FAST-TRACK PROCEDURE".  The ISO fast track procedure is designed to ease the approval of existing standards that have been created by other standards bodies.  Since neither the PDF Reference, nor the new draft document, have ever been formally recognized as a standard by any other standards body; I do not believe the fast track procedure should be applied in this case.

I have been sitting on standards committees for the past five years.  The amount of time needed to develop the majority of standards is typically three to five years.   The PDF/A standard is an anomaly in being adopted in just over two years.  The PDF/A standard is 39 pages.

PDF Basic File Layout

A Typical PDF File

For the most part; the basic layout of a PDF file can be fairly simple.  A PDF file consists of four primary sections as illustrated below:

image

The PDF file "Header" is just one or two lines starting with %PDF.  The "Body" is a collection of objects which include the page contents, fonts, annotations, etc.  The "xref Table", or cross reference table, is a collection of pointers to locate the individual objects contained in the "Body".  The "Trailer" contains the pointer to the start of the cross reference table.

Incremental Saves

Starting with the basic layout above; PDF supports the concept of incremental saves.  This is the ability to make modifications to the file without altering the actual content of the original saved document.

image

There are several advantages to incremental saves.

1. Saving the file to disk is quicker because you are only tacking the new data to the end of an existing file.

2. An incrementally saved document contains an audit trail of changes to the PDF file.  This allows the file to be "rolled back" to a previous save.

3. The incremental save mechanism is also used to support multiple digital signatures on a single PDF file.

There is also a significant disadvantage to the incremental save mechanism.  Selecting "Save" under the Acrobat file menu automatically does an incremental save.  When PDF documents are edited; for example, when the user add form fields or comments, the document is typically "saved" multiple times.  This leads to file size increase, because the unused or obsolete data remains in the PDF file.

To remove the unused data in an incrementally saved PDF file an Acrobat user needs to perform a "Save As...". We have seen cases where a 200 KB PDF file increased in size to over 2.5 MB due to incremental saves. In these cases, a simple "Save As" can result in dramatic file size reductions.

Adobe Bates Numbering?

We received an email from one of our customers, who is an attorney, who uses Bates numbering on a regular basis.  Following is one of the sentences from this customers email:

"I wouldn't have thought it possible, but Adobe has managed to implement its Bates-stamping in a manner which makes it virtually useless [or at least highly impractical for use by] attorneys, the primary users of Bates-stamp utilities." 

When I saw this I decided to take a look at Acrobat Bates Numbering. 

I really don't use most of the features available in Acrobat, this being no exception.  But, I do know how Bates numbering works because Appligent was the first to implement Bates numbering of PDF documents in our StampPDF product line back in 1997.

I started with a simple 20 page test document and placed a basic Bates number sequence in the top left corner.  That seemed to work; but, I did notice that Acrobat would automatically overwrite the original PDF documents without giving me a warning that I was about to alter the originals.  Unless I missed it; I did not see an option to save the files to a new directory.

Next, I tried left/right page numbering ( recto verso format ).  This is very useful when working with bound documents.  The odd page header and the even page header are on opposite sides of the page.  For example; the odd page header will be on the right side and the even page header will be on the left side. 

The first problem was figuring out how to place numbers differently on odd and even pages.  There is a popup allowing you to select odd or even page ranges; but, it appears to only let you do one or the other.  This would force the user to Bates number the document twice.  So, I gave it a try for odd pages. 

Acrobat Bates numbering of odd pages, starting with the number 1, results in the 000001 being placed on the first page just like it should be.  The second Bates number 000002 is then placed on page 3.  000003 is placed on page 5. 000004 is placed on page 7; etc.  Literally skipping over the even pages and placing the wrong Bates numbers on all of the pages except the first page.

My first recommendation with regard to Acrobat Bates numbering is to duplicate the files to be numbered so you don't overwrite the original PDF documents.

My second recommendation is, anyone using Acrobat Bates numbering, should very carefully verify the results before submitting documents to a court.

See Also: Acrobat XML Tags for Bates Numbering

Acrobat 8 Text Shifting

Following is a collection of screen shots taken using a single PDF file displayed under Acrobat 4 through Acrobat 8.

Acrobat 4

Acrobat_4

Acrobat 5

Acrobat_5

Acrobat 6

Acrobat_6

Acrobat 7

Acrobat_7

Acrobat 8

Acrobat_8

Following is a PDF file which demonstrates the text shifting problem:

Acrobat_8_Text_Shift.pdf

This particular drawing error is caused by passing a large negative character spacing in a text array when the text is of zero length. 

132.96 741.6 TD -0.06048 Tc [()-4800()] TJ -0.32976 Tc (A) Tj

Since Adobe has never documented proper coding constructs; the above has become a common technique used by some third party developers to move to the beginning of a line.

The PDF Reference states "Strings presented to the text-showing operators may be any length...".  Since the PDF Reference does not state that text can not be zero length; the above PDF construct is correct.  Unfortunately, it looks like someone in the Acrobat 8 engineering team "didn't get the memo".

My original title for this entry was "When is a validation tool, not a validation tool?"  But that looked a bit too long and the reference a bit obscure. 

PDF does not have a concept of "Well Formed and Valid" like XML.  In addition, Adobe has never created a PDF validation tool to help PDF developers determine if the PDF they produce is correct.  By default; Adobe Reader is used by developers as a PDF validation tool.  Unfortunately, Adobe Reader is not, and never was intended to be, a validation tool.

With glaring errors like this in Acrobat 8; the whole concept of PDF as "digital paper" begins to unravel.

PDF - The Missing References

The Adobe PDF Reference is similar to the Adobe Postscript Language Reference; in that they can both be compared to a dictionary.  A dictionary is a document which contains all of the words that can be used in a language; but, it doesn't teach you how to combine those words into a good, well structured book.

PDF is based on Postscript.  The documentation for Postscript was released as a set of three volumes.

Postscript Language Reference - Red Book

PostScript Language Program Design - Green Book

PostScript Language Tutorial and Cookbook - Blue Book

These documents are typically referred to by the color name; because, the covers of the books are actually Red, Green and Blue.

The PDF Reference is similar to the Postscript Red Book.  Unlike Postscript; Adobe has not released a PDF Blue Book or a PDF Green Book to help teach developers how to construct well formed PDF files.

So why is Postscript important to PDF?  Much of PDF is built on top of Postscript; and, since the Blue Book and the Green Book do not exist for PDF; reading and understanding the Postscript documentation is the next best information source available to PDF developers.

The above links can be found on a web site for the "Collider Detector at Fermilab".  The lab has an excellent page on Postscript.

PDF Version Numbers

I find that there is a general misunderstanding about the nature of Portable Document Format (PDF) version numbers.

Version 1.0 of the PDF file format was released by Adobe in 1993. Over the past fourteen years PDF has been updated seven times.  The current version of PDF is 1.7. These changes to the PDF version number represent additions to the file format.

All of the "older" stuff in PDF works exactly the same way it did.  None of the basic PDF text drawing primitives have changed.  PDF 1.0 is still perfectly valid and usable; it is the basis for all PDF files in existence today; it simply can not be used to represent more advanced features or graphics found in later versions of PDF.

pdf_versions

Following is a list of new document features which are available in the various versions of PDF.

 
PDF  Updates
1.1
Document Encryption (40 bit), Article Threads, Named Destinations, Link Actions and Device Independent Color Resources
1.2
Form Fields, Halftone Screens and other advanced color features, and support for Chinese, Japanese and Korean text
1.3
Digital Signatures, Logical Structure, JavaScript, Embedded Files, Masked Images, Smooth Shading, Support for additional color spaces and CID fonts
1.4
Document Encryption (128 bit), Tagged PDF, Accessibility Support, Transparency, Metadata Streams
1.5
Document Encryption (Public Key), JPEG 2000 Compression, Optional Content Groups, Additional Annotation Types
1.6
Document Encryption (AES), Increase Maximum Page Size, Incorporate 3D Artwork, Additional Annotation Types
1.7
Some additional annotation support for 3D and engineering.

So, if you only need basic text on a page and 40 bit encryption; you only need PDF 1.1.  If you want to use JPEG 2000 compression; you need PDF 1.5.

Taking a PDF 1.4 file and "updating" it to PDF 1.7 really has no meaning because the PDF 1.4 document doesn't contain any PDF 1.7 additions.

The majority of PDF files in common use today are typically PDF 1.4.

Acrobat XML Tags for Bates Numbering

Adobe has released a technical note talking about additional XML data Acrobat 8 adds to each page of a PDF file when the file is Bates numbered using Acrobat 8.

Bates Numbering in PDF documents (PDF, 123K)

Here is what the XML looks like:

  <Bates start="1" ndigits="6" prefix="ADBE" suffix="DRAFT"/>

The above XML is added to each page of the PDF file and will produce a Bates number on each page: for example;  ADBE000001DRAFT.

So, instead of simply correctly numbering each and every page; applications that attempt to use this information will need to calculate the Bates number based on the above XML attributes.  Easy enough; except, lawyers tend to split documents apart and append them back together in different ways.  So, in some cases, this mechanism will be worse then useless, because the wrong number will be returned.

Since Acrobat is placing the same XML data on each and every page; it would have been so easy to simply add another attribute with the actual Bates number for the given page.

bates2

The above image depicts the classic method of Bates Numbering documents.  The original can be found at the Early Office Museum: Office Photos ~ 1920s

Copyright 2009 by Appligent, Inc.