| <!-- |
| ! Licensed to the Apache Software Foundation (ASF) under one or more |
| ! contributor license agreements. See the NOTICE file distributed with |
| ! this work for additional information regarding copyright ownership. |
| ! The ASF licenses this file to You under the Apache License, Version 2.0 |
| ! (the "License"); you may not use this file except in compliance with |
| ! the License. You may obtain a copy of the License at |
| ! |
| ! http://www.apache.org/licenses/LICENSE-2.0 |
| ! |
| ! Unless required by applicable law or agreed to in writing, software |
| ! distributed under the License is distributed on an "AS IS" BASIS, |
| ! WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| ! See the License for the specific language governing permissions and |
| ! limitations under the License. |
| !--> |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| <document> |
| <header> |
| <title>PDFBox - User Guide</title> |
| </header> |
| <body> |
| <section> |
| <title>Introduction</title> |
| <p> |
| This page will discuss the internals of PDF documents |
| and how those internals map to PDFBox classes. |
| Users should reference the javadoc to see what classes and methods are available. The |
| <a href="http://partners.adobe.com/public/developer/pdf/index_reference.html">Adobe PDF Reference</a> |
| can be used to determine detailed information about fields and their meanings. |
| </p> |
| </section> |
| <section> |
| <title>Examples</title> |
| <p>A variety of examples can be found in the src/org/pdfbox/examples folder. This guide will refer to |
| specific examples as needed. |
| </p> |
| </section> |
| <section> |
| <title>PDF File Format Overview</title> |
| <p> |
| A PDF document is a stream of basic object types. The low level objects are represented in PDFBox |
| in the <i>org.pdfbox.cos</i> package. The basic types in a PDF are: |
| </p> |
| <table> |
| <tr> |
| <th>PDF Type</th> |
| <th>Description</th> |
| <th>Example</th> |
| <th>PDFBox class</th> |
| </tr> |
| <tr> |
| <td>Array</td> |
| <td>An ordered list of items</td> |
| <td>[1 2 3]</td> |
| <td>org.pdfbox.cos.COSArray</td> |
| </tr> |
| <tr> |
| <td>Boolean</td> |
| <td>Standard True/False values</td> |
| <td>true</td> |
| <td>org.pdfbox.cos.COSBoolean</td> |
| </tr> |
| <tr> |
| <td>Dictionary</td> |
| <td>A map of name value pairs</td> |
| <td><<<br/> |
| /Type /XObject<br/> |
| /Name (Name)<br/> |
| /Size 1<br/> |
| >> |
| </td> |
| <td>org.pdfbox.cos.COSDictionary</td> |
| </tr> |
| <tr> |
| <td>Number</td> |
| <td>Integer and Floating point numbers</td> |
| <td>1 2.3</td> |
| <td>org.pdfbox.cos.COSFloat<br />org.pdfbox.cos.COSInteger</td> |
| </tr> |
| <tr> |
| <td>Name</td> |
| <td>A predefined value in a PDF document, typically used as a key in a dictionary</td> |
| <td>/Type</td> |
| <td>org.pdfbox.cos.COSName</td> |
| </tr> |
| <tr> |
| <td>Object</td> |
| <td>A wrapper to any of the other objects, this can be used to reference an object multiple times. |
| An object is referenced by using two numbers, an object number and a generation number. Initially |
| the generation number will be zero unless the object got replaced later in the stream. |
| </td> |
| <td>12 0 obj << /Type /XObject >> endobj</td> |
| <td>org.pdfbox.cos.COSObject</td> |
| </tr> |
| <tr> |
| <td>Stream</td> |
| <td>A stream of data, typically compressed. This is used for page contents, images and |
| embedded font streams. |
| </td> |
| <td>12 0 obj << /Type /XObject >> stream 030004040404040404 endstream</td> |
| <td>org.pdfbox.cos.COSStream</td> |
| </tr> |
| <tr> |
| <td>String</td> |
| <td>A sequence of characters |
| </td> |
| <td>(This is a string)</td> |
| <td>org.pdfbox.cos.COSString</td> |
| </tr> |
| </table> |
| <p> |
| A page in a pdf document is represented with a COSDictionary. The entries that are available for |
| a page can be seen in the PDF Reference and an example of a page looks like this: |
| </p> |
| <table> |
| <tr><td> |
| <pre> |
| << |
| /Type /Page |
| /MediaBox [0 0 612 915] |
| /Contents 56 0 R |
| >></pre> |
| </td></tr> |
| </table> |
| |
| <p>Some Java code to access fields</p> |
| |
| <table> |
| <tr><td> |
| <pre>COSDictionary page = ...; |
| COSArray mediaBox = (COSArray)page.getDictionaryObject( "MediaBox" ); |
| System.out.println( "Width:" + mediaBox.get( 3 ) ); |
| </pre> |
| </td></tr> |
| </table> |
| </section> |
| <section> |
| <title>PD Model</title> |
| <p>The COS Model allows access to all aspects of a PDF document. This type of programming is |
| tedious and error prone though because the user must know all of the names of the parameters |
| and no helper methods are available. The PD Model was created to help alleviate this problem. |
| Each type of object(page, font, image) has a set of defined attributes that can be available |
| in the dictionary. A PD Model class is available for each of these so that strongly typed |
| methods are available to access the attributes. The same code from above to get the page width |
| can be rewritten to use PD Model classes. |
| </p> |
| <table> |
| <tr><td> |
| <pre>PDPage page = ...; |
| PDRectangle mediaBox = page.getMediaBox(); |
| System.out.println( "Width:" + mediaBox.getWidth() );</pre> |
| </td></tr> |
| </table> |
| <p>PD Model objects sit on top of COS model. Typically, the classes in the PD Model |
| will only store a COS object and all setter/getter methods will modify data that |
| is stored in the COS object. For example, when you call PDPage.getLastModified() the method |
| will do a lookup in the COSDictionary with the key "LastModified", if it is found the value is |
| then converter to a java.util.Calendar. When PDPage.setLastModified( Calendar ) is called |
| then the Calendar is converted to a string in the COSDictionary. |
| </p> |
| <p>Here is a visual depiction of the COS Model and PD Model design.</p> |
| <center><img src="../images/cos-pdmodel diagram.png" /></center> |
| <p> |
| This design presents many advantages and disadvantages.<br /><br/> |
| <b>Advantages:</b></p> |
| <ul> |
| <li>Simple, easy to use API.</li> |
| <li>Underlying document automatically gets updated when you update the PD Model</li> |
| <li>Ability to easily access the COS Model from any PD Model object</li> |
| <li>Easily add to and update existing PDF documents</li> |
| </ul> |
| <p><b>Disadvantages:</b></p> |
| <ul> |
| <li>Object caching is not done in the PD Model classes |
| <note>For example, each call to PDPage.getMediaBox() will return a new PDRectangle |
| object, but will contain the same underlying COSArray.</note> |
| </li> |
| </ul> |
| </section> |
| </body> |
| </document> |