blob: b3f3e78c9b5cc10fabf4a86a69738ab1c46675c5 [file] [log] [blame]
===================================================
Apache PDFBox <http://incubator.apache.org/pdfbox/>
===================================================
PDFBox is an open source Java library for working with PDF documents.
You need Apache Ant <http://ant.apache.org/> to build PDFBox. Once you
have installed Ant, you can build the sources by running "ant" in
this directory.
You can customize the build by adding a "build.properties" file that overrides
the default build properties. For example, the Ant build will create a
Checkstyle report if you have Checkstyle <http://checkstyle.sourceforge.net/>
installed. Set the checkstyle.home.dir property to enable the report:
checkstyle.home.dir=/path/to/checkstyle
The Ant build will build the PDFBox web site if you have Apache Forrest
<http://forrest.apache.org/> installed. Set the FORREST_HOME environment
variable to enable the web site build.
Known Limitations and Problems
==============================
1. You get text like "G38G43G36G51G5" instead of what you expect when you are
extracting text. This is because the characters are a meaningless internal
encoding that point to glyphs that are embedded in the PDF document. The
only way to access the text is to use OCR. This may be a future
enhancement.
2. You get an error message like "java.io.IOException: Can't handle font width"
this MIGHT be due to the fact that you don't have the Resources directory
in your classpath. The easiest solution is to simply include the
apache-pdfbox-x.x.x.jar in your classpath.
3. You get text that has the correct characters, but in the wrong
order. This mght be because you have not enabled sorting. The text
in PDF files is stored in chunks and the chunks do not need to be stored
in the order that they are displayed on a page. By default, PDFBox does
not sort the text. Also, if you have text in a language that reads right to left
(such as Arabic or Hebrew), make sure you have the ICU4J jar file in your
classpath. This library is needed to properly hande right to left text.
See the issue tracker at https://issues.apache.org/jira/browse/PDFBOX for
the full list of known issues and requested features.
Disclaimer
==========
Apache PDFBox is an effort undergoing incubation at The Apache Software
Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
required of all newly accepted projects until a further review indicates
that the infrastructure, communications, and decision making process have
stabilized in a manner consistent with other successful ASF projects. While
incubation status is not necessarily a reflection of the completeness or
stability of the code, it does indicate that the project has yet to be fully
endorsed by the ASF.
See http://incubator.apache.org/projects/pdfbox.html for the current
incubation status of the Apache PDFBox project.
License (see also LICENSE.txt)
==============================
Collective work: Copyright 2009 The Apache Software Foundation.
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Unmodifiable files
==================
Apache PDFBox contains Adobe CMap and Glyph files that may be redistributed
only in *unmodified* form. See the LICENSE file for the exact licensing
conditions.
Export control
==============
This distribution includes cryptographic software. The country in which
you currently reside may have restrictions on the import, possession, use,
and/or re-export to another country, of encryption software. BEFORE using
any encryption software, please check your country's laws, regulations and
policies concerning the import, possession, or use, and re-export of
encryption software, to see if this is permitted. See
<http://www.wassenaar.org/> for more information.
The U.S. Government Department of Commerce, Bureau of Industry and
Security (BIS), has classified this software as Export Commodity Control
Number (ECCN) 5D002.C.1, which includes information security software using
or performing cryptographic functions with asymmetric algorithms. The form
and manner of this Apache Software Foundation distribution makes it eligible
for export under the License Exception ENC Technology Software Unrestricted
(TSU) exception (see the BIS Export Administration Regulations, Section
740.13) for both object code and source code.
The following provides more details on the included cryptographic software:
Apache PDFBox uses the Java Cryptography Architecture (JCA) and the
Bouncy Castle libraries for handling encryption in PDF documents.