by Mark Gavin
Below is a minimal PDF document which displays the text “Hello World”. This PDF document is formatted for readability. The header, xref and trailer have been omitted.
Start at the bottom at Object 6, the “Catalog” dictionary, and work up through the object numbers. Object 6 contains a reference to the “Pages” dictionary located at Object 5. In the Pages dictionary “Kids” is an array of references to “Page” dictionaries; “Count” is the number of document pages associated with this “Pages” dictionary; “MediaBox” is the size of the paper in points.
1 0 obj <</Type/Page /Parent 5 0 R /Resources 3 0 R /Contents 2 0 R >> endobj 2 0 obj <</Length 51>> stream BT /F1 48 Tf 1 0 0 1 210 400 Tm (Hello World)Tj ET endstream endobj 3 0 obj <</ProcSet[/PDF/Text] /Font<</F1 4 0 R>> >> endobj 4 0 obj << /Type/Font /Subtype/Type1 /Name/F1 /BaseFont/Helvetica >> endobj 5 0 obj <</Type/Pages /Kids[ 1 0 R] /Count 1 /MediaBox[ 0 0 612 792 ]>> endobj 6 0 obj <</Type/Catalog /Pages 5 0 R>> endobj
Object 1 is the “Page” dictionary and contains references to the pages “Resources” and the page “Contents”. The page resources primarily include fonts and images to be displayed on the page. Internal to a PDF file; fonts are given unique names. In this case the unique font name is “F1”.
The page content is a stream of PDF drawing operators.
BT /F1 48 Tf 1 0 0 1 210 400 Tm (Hello World)Tj ET
The following table contains a very small subset of the drawing operators available in PDF.
Operators | Meaning | Example |
---|---|---|
BT | Begin Text | |
ET | End Text | |
Tf | Text Font & Size | /F1 24 Tf |
Tm | Text Matrix | 1 0 0 1 260 600 Tm |
Tj | Show Text | (Hello World)Tj |
The internals of a PDF content stream use postfix notation just like Postscript and Forth. The values used by the operator are pushed onto a stack. The operator then pulls the values off of the stack as needed. For example; to set the font and size in the above page content steam; first the unique name for the font is pushed onto the stack; next the font size is pushed onto the stack; then the Tf operator pops both values off of the stack and uses them to set the font and font size.
The Tm (Text Matrix) operator is responsible for not only locating the text on the page; but also, setting the text scale, rotation and skew. See ISO-32000 Section 8.3.3.
The Tj (Show Text) operator is one of a few operators used to specify the character glyphs to be drawn to the page.
It is important to remember that PDF is a binary file format. It may look like text; but, it is not. And no one should be under the illusion that it can be edited like text. Editing a PDF file as if it were a text file will corrupt the PDF file.