PDF Cross Reference Table

Aug 26, 2010 12:00:00 AM | Linearization PDF Cross Reference Table

The PDF Cross Reference Table (xref) is the third major section of a PDF File. It is the index by which all the indirect objects in the PDF file are located.

by Mark Gavin

The PDF Cross Reference Table (xref) is the third major section of a PDF File.  Please refer to the PDF Basic File Layout.  The xref is the index by which all of the indirect objects, in the PDF file, are located.  A single PDF file can contain multiple xref tables if the file has been incrementally saved or linearized.

Typically, the PDF cross reference table will have the following form.

PDF Cross Reference Table

The cross reference table starts with the word “xref”.  In the above example; all of the data following the word “xref” is an “xref subsection”.  An xref can contain more then one xref subsection.

The first line of the xref subsection contains two numbers. The first number is the numerical ID of the first object in the this xref subsection.  The second number is the count of objects in this xref subsection.

The remainder of the data in the xref subsection contains a sequence of lines which represent three types of data associated with each PDF indirect object as follows:

1. The location of the object specified using the byte offset to the object from the beginning of the PDF file.
2. The generation number of the object.
3. A flag defining if this specific object is in use or free.

Each line of the second portion of the xref subsection MUST be exactly 20 bytes long; including the line ending characters.

PDF 1.5 introduced a (optional) new form of XREF; a cross-reference stream rather than a cross-reference table. A PDF 1.5 file may contain one or both for backward compatibility. The advantages of the new XREF is reduced file size and support for documents greater the 10 GBytes. Following is an example of a cross-reference stream.  More information can be found in ISO-32000 Section 7.5.8.

stream
01 0E8A 0 % Entry for object 2 (0x0E8A = 3722)
02 0002 00 % Entry for object 3 (in object stream 2, index 0)
02 0002 01 % Entry for object 4 (in object stream 2, index 1)
02 0002 02 % …
02 0002 03
02 0002 04
02 0002 05
02 0002 06
02 0002 07 % Entry for object 10 (in object stream 2, index 7)
01 1323 0 % Entry for object 11 (0x1323 = 4899)
endstream
Mark Gavin

Written By: Mark Gavin

Appligent Chief Technology Officer and software architect. Mark invented PDF redaction in 1997 and is also the creator of several other first-ever PDF applications, including Appligent’s SecurSign and FDFMerge, EMC’s Documentum IRM for PDF, and Liquent’s CoreDossier.