blob: c8375e08c909ac60456367f42c686dae3cac0222 [file]
<!--
! Licensed to the Apache Software Foundation (ASF) under one or more
! contributor license agreements. See the NOTICE file distributed with
! this work for additional information regarding copyright ownership.
! The ASF licenses this file to You under the Apache License, Version 2.0
! (the "License"); you may not use this file except in compliance with
! the License. You may obtain a copy of the License at
!
! http://www.apache.org/licenses/LICENSE-2.0
!
! Unless required by applicable law or agreed to in writing, software
! distributed under the License is distributed on an "AS IS" BASIS,
! WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
! See the License for the specific language governing permissions and
! limitations under the License.
!-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
<document>
<header>
<title>PDFBox - User Guide</title>
</header>
<body>
<section>
<title>Introduction</title>
<p>
This page will discuss the internals of PDF documents
and how those internals map to PDFBox classes.
Users should reference the javadoc to see what classes and methods are available. The
<a href="http://partners.adobe.com/public/developer/pdf/index_reference.html">Adobe PDF Reference</a>
can be used to determine detailed information about fields and their meanings.
</p>
</section>
<section>
<title>Examples</title>
<p>A variety of examples can be found in the src/org/pdfbox/examples folder. This guide will refer to
specific examples as needed.
</p>
</section>
<section>
<title>PDF File Format Overview</title>
<p>
A PDF document is a stream of basic object types. The low level objects are represented in PDFBox
in the <i>org.pdfbox.cos</i> package. The basic types in a PDF are:
</p>
<table>
<tr>
<th>PDF Type</th>
<th>Description</th>
<th>Example</th>
<th>PDFBox class</th>
</tr>
<tr>
<td>Array</td>
<td>An ordered list of items</td>
<td>[1 2 3]</td>
<td>org.pdfbox.cos.COSArray</td>
</tr>
<tr>
<td>Boolean</td>
<td>Standard True/False values</td>
<td>true</td>
<td>org.pdfbox.cos.COSBoolean</td>
</tr>
<tr>
<td>Dictionary</td>
<td>A map of name value pairs</td>
<td>&lt;&lt;<br/>
/Type /XObject<br/>
/Name (Name)<br/>
/Size 1<br/>
&gt;&gt;
</td>
<td>org.pdfbox.cos.COSDictionary</td>
</tr>
<tr>
<td>Number</td>
<td>Integer and Floating point numbers</td>
<td>1 2.3</td>
<td>org.pdfbox.cos.COSFloat<br />org.pdfbox.cos.COSInteger</td>
</tr>
<tr>
<td>Name</td>
<td>A predefined value in a PDF document, typically used as a key in a dictionary</td>
<td>/Type</td>
<td>org.pdfbox.cos.COSName</td>
</tr>
<tr>
<td>Object</td>
<td>A wrapper to any of the other objects, this can be used to reference an object multiple times.
An object is referenced by using two numbers, an object number and a generation number. Initially
the generation number will be zero unless the object got replaced later in the stream.
</td>
<td>12 0 obj &lt;&lt; /Type /XObject &gt;&gt; endobj</td>
<td>org.pdfbox.cos.COSObject</td>
</tr>
<tr>
<td>Stream</td>
<td>A stream of data, typically compressed. This is used for page contents, images and
embedded font streams.
</td>
<td>12 0 obj &lt;&lt; /Type /XObject &gt;&gt; stream 030004040404040404 endstream</td>
<td>org.pdfbox.cos.COSStream</td>
</tr>
<tr>
<td>String</td>
<td>A sequence of characters
</td>
<td>(This is a string)</td>
<td>org.pdfbox.cos.COSString</td>
</tr>
</table>
<p>
A page in a pdf document is represented with a COSDictionary. The entries that are available for
a page can be seen in the PDF Reference and an example of a page looks like this:
</p>
<table>
<tr><td>
<pre>
&lt;&lt;
/Type /Page
/MediaBox [0 0 612 915]
/Contents 56 0 R
&gt;&gt;</pre>
</td></tr>
</table>
<p>Some Java code to access fields</p>
<table>
<tr><td>
<pre>COSDictionary page = ...;
COSArray mediaBox = (COSArray)page.getDictionaryObject( "MediaBox" );
System.out.println( "Width:" + mediaBox.get( 3 ) );
</pre>
</td></tr>
</table>
</section>
<section>
<title>PD Model</title>
<p>The COS Model allows access to all aspects of a PDF document. This type of programming is
tedious and error prone though because the user must know all of the names of the parameters
and no helper methods are available. The PD Model was created to help alleviate this problem.
Each type of object(page, font, image) has a set of defined attributes that can be available
in the dictionary. A PD Model class is available for each of these so that strongly typed
methods are available to access the attributes. The same code from above to get the page width
can be rewritten to use PD Model classes.
</p>
<table>
<tr><td>
<pre>PDPage page = ...;
PDRectangle mediaBox = page.getMediaBox();
System.out.println( "Width:" + mediaBox.getWidth() );</pre>
</td></tr>
</table>
<p>PD Model objects sit on top of COS model. Typically, the classes in the PD Model
will only store a COS object and all setter/getter methods will modify data that
is stored in the COS object. For example, when you call PDPage.getLastModified() the method
will do a lookup in the COSDictionary with the key "LastModified", if it is found the value is
then converter to a java.util.Calendar. When PDPage.setLastModified( Calendar ) is called
then the Calendar is converted to a string in the COSDictionary.
</p>
<p>Here is a visual depiction of the COS Model and PD Model design.</p>
<center><img src="../images/cos-pdmodel diagram.png" /></center>
<p>
This design presents many advantages and disadvantages.<br /><br/>
<b>Advantages:</b></p>
<ul>
<li>Simple, easy to use API.</li>
<li>Underlying document automatically gets updated when you update the PD Model</li>
<li>Ability to easily access the COS Model from any PD Model object</li>
<li>Easily add to and update existing PDF documents</li>
</ul>
<p><b>Disadvantages:</b></p>
<ul>
<li>Object caching is not done in the PD Model classes
<note>For example, each call to PDPage.getMediaBox() will return a new PDRectangle
object, but will contain the same underlying COSArray.</note>
</li>
</ul>
</section>
</body>
</document>