website/src/documentation/content/xdocs/userguide/faq.xml - pdfbox - Git at Google

 <!--
  ! Licensed to the Apache Software Foundation (ASF) under one or more
  ! contributor license agreements.  See the NOTICE file distributed with
  ! this work for additional information regarding copyright ownership.
  ! The ASF licenses this file to You under the Apache License, Version 2.0
  ! (the "License"); you may not use this file except in compliance with
  ! the License.  You may obtain a copy of the License at
  !
  !      http://www.apache.org/licenses/LICENSE-2.0
  !
  ! Unless required by applicable law or agreed to in writing, software
  ! distributed under the License is distributed on an "AS IS" BASIS,
  ! WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  ! See the License for the specific language governing permissions and
  ! limitations under the License.
  !-->
 <!DOCTYPE faqs PUBLIC "-//APACHE//DTD FAQ V1.2//EN" "http://forrest.apache.org/dtd/faq-v12.dtd" [
 <!ENTITY s '<code>site.xml</code>'>
 ]>

 <faqs title="Frequently Asked Questions">

   <part id="general_questions">
     <title>General Questions</title>
     <faq id="next_version">
       <question>
         When will the next version of PDFBox be released?
       </question>
       <answer>
         <p>
           As fixes are made and integrated into the repository these changes are documented in the
           <link href="../changes.html">release notes</link>.  An
           estimate will be given of when
           the next version will be released. <br /><br />
           Of course, this is only an estimate and could change.
         </p>
       </answer>
     </faq>

     <faq id="log4j_config">
       <question>
           I am getting the below Log4J warning message, how do I remove it?

       </question>
       <answer>
   <table>
   	<tr><td>
   	log4j:WARN No appenders could be found for logger (org.pdfbox.util.ResourceLoader).<br />
 	log4j:WARN Please initialize the log4j system properly.
 	</td></tr>
   </table>
         <p>
         This message means that you need to configure the log4j logging system.
         See the <link href="http://logging.apache.org/log4j/docs/documentation.html">log4j documentation</link> for more information.
         </p>
         <p>
         PDFBox comes with a sample log4j configuration file.  To use it you set a
 		system property like this
 		</p>
 		<p>

 		java -Dlog4j.configuration=log4j.xml org.pdfbox.ExtractText &lt;PDF-file&gt; &lt;output-text-file&gt;
         </p>
         <p>
         If this is not working for you then you may have to specify the log4j config file using a URL path, like this:<br/>
         <br />
         log4j.configuration=file:///&lt;path to config file&gt;<br />
         <br/>
         Please see <a href="https://sourceforge.net/forum/forum.php?thread_id=1254229&amp;forum_id=267205">this</a> forum thread for more information.

         </p>
       </answer>
     </faq>

     <faq id="pdfbox_threadsafe">
       <question>
       Is PDFBox thread safe?
       </question>
       <answer>
         <p>
         No!  Only one thread may access a single document at a time.
         You can have multiple threads each accessing their own PDDocument object.
         </p>
       </answer>
     </faq>

     <faq id="pdfbox_close_warning">
       <question>
       Why do I get a "Warning: You did not close the PDF Document"?
       </question>
       <answer>
         <p>
         You need to call close() on the PDDocument inside the finally block, if you
         don't then the document will not be closed properly.  Also, you must close all
         PDDocument objects that get created.  The following code creates <b>two</b>
         PDDocument objects; one from the "new PDDocument()" and the second by the load method.
         </p>
         <pre>
     PDDocument doc = new PDDocument();
     try
     {
         doc = PDDocument.load( "my.pdf" );
     }
     finally
     {
         if( doc != null )
         {
             doc.close();
         }
     }
         </pre>

       </answer>
     </faq>


   </part>

   <part id="text_extraction">
     <title>Text Extraction</title>
     <faq id="no_text_extraction">
       <question>
         How come I am not getting any text from the PDF document?
       </question>
       <answer>
         <p>
           Text extraction from a pdf document is a complicated task and there are many factors
           involved that effect the possibility and accuracy of text extraction.  It would be helpful
           to the PDFBox team if you could try a couple things.
         </p>
           <ul>
           	<li>Open the PDF in Acrobat and try to extract text from there.  If Acrobat can extract text
           	then PDFBox should be able to as well and it is a bug if it cannot.  If Acrobat cannot extract text then
           	PDFBox 'probably' cannot either.</li>
           	<li>It might really be an image instead of text.  Some PDF documents are just images that have
           	been scanned in.  You can tell by using the selection tool in Acrobat, if you can't select
           	any text then it is probably an image.</li>
           </ul>

       </answer>
     </faq>
     <faq id="gibberish_text">
       <question>
         How come I am getting gibberish(G38G43G36G51G5) when extracting text?
       </question>
       <answer>
         <p>
           This is because the characters in a PDF document can use a custom encoding
           instead of unicode or ASCII.  When you see gibberish text then it
           probably means that a meaningless internal encoding is being used.  The
 		   only way to access the text is to use OCR.  This may be a future
 		   enhancement.
         </p>
       </answer>
     </faq>
     <faq id="cant_handle_font_width">
       <question>
         What does "java.io.IOException: Can't handle font width" mean?
       </question>
       <answer>
         <p>
           This probably means that the "Resources" directory is not in your classpath.  The
           Resources directory is included in the PDFBox jar so this is only a problem if you
           are building PDFBox yourself and not using the binary.
         </p>
       </answer>
     </faq>
     <faq id="no_permission">
       <question>
         Why do I get "You do not have permission to extract text" on some documents?
       </question>
       <answer>
         <p>
           PDF documents have certain security permissions that can
           be applied to them and two passwords associated with them, a user password and a master password.
           If the "cannot extract text" permission bit is set then you need
           to decrypt the document with the master password in order to extract the text.
         </p>
       </answer>
     </faq>
     <faq id="parse_whole_document">
         <question>Can't we just extract the text without parsing the whole document or extract text as it is parsed.</question>
         <answer>
         <p>
         Not really, for a couple reasons.
         </p>
         <ol>
             <li>If the document is encrypted then you need to parse at least until the encryption dictionary before you can decrypt.</li>
             <li>Sometimes the PDFont contains vital information needed for text extraction.</li>
             <li>Text on a page does not have to be drawn in reading order.  For example; if the page said "Hello World", the pdf could
                 have been written such that "World" gets drawn and then the cursor moves to the left and the word "Hello" is drawn.</li>
         </ol>
         </answer>
     </faq>
   </part>

 </faqs>
	<!--
	! Licensed to the Apache Software Foundation (ASF) under one or more
	! contributor license agreements. See the NOTICE file distributed with
	! this work for additional information regarding copyright ownership.
	! The ASF licenses this file to You under the Apache License, Version 2.0
	! (the "License"); you may not use this file except in compliance with
	! the License. You may obtain a copy of the License at
	!
	! http://www.apache.org/licenses/LICENSE-2.0
	!
	! Unless required by applicable law or agreed to in writing, software
	! distributed under the License is distributed on an "AS IS" BASIS,
	! WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	! See the License for the specific language governing permissions and
	! limitations under the License.
	!-->
	<!DOCTYPE faqs PUBLIC "-//APACHE//DTD FAQ V1.2//EN" "http://forrest.apache.org/dtd/faq-v12.dtd" [
	<!ENTITY s '<code>site.xml</code>'>
	]>

	<faqs title="Frequently Asked Questions">

	<part id="general_questions">
	<title>General Questions</title>
	<faq id="next_version">
	<question>
	When will the next version of PDFBox be released?
	</question>
	<answer>
	<p>
	As fixes are made and integrated into the repository these changes are documented in the
	<link href="../changes.html">release notes</link>. An
	estimate will be given of when
	the next version will be released. <br /><br />
	Of course, this is only an estimate and could change.
	</p>
	</answer>
	</faq>

	<faq id="log4j_config">
	<question>
	I am getting the below Log4J warning message, how do I remove it?

	</question>
	<answer>
	<table>
	<tr><td>
	log4j:WARN No appenders could be found for logger (org.pdfbox.util.ResourceLoader).<br />
	log4j:WARN Please initialize the log4j system properly.
	</td></tr>
	</table>
	<p>
	This message means that you need to configure the log4j logging system.
	See the <link href="http://logging.apache.org/log4j/docs/documentation.html">log4j documentation</link> for more information.
	</p>
	<p>
	PDFBox comes with a sample log4j configuration file. To use it you set a
	system property like this
	</p>
	<p>

	java -Dlog4j.configuration=log4j.xml org.pdfbox.ExtractText <PDF-file> <output-text-file>
	</p>
	<p>
	If this is not working for you then you may have to specify the log4j config file using a URL path, like this:<br/>
	<br />
	log4j.configuration=file:///<path to config file><br />
	<br/>
	Please see <a href="https://sourceforge.net/forum/forum.php?thread_id=1254229&forum_id=267205">this</a> forum thread for more information.

	</p>
	</answer>
	</faq>

	<faq id="pdfbox_threadsafe">
	<question>
	Is PDFBox thread safe?
	</question>
	<answer>
	<p>
	No! Only one thread may access a single document at a time.
	You can have multiple threads each accessing their own PDDocument object.
	</p>
	</answer>
	</faq>

	<faq id="pdfbox_close_warning">
	<question>
	Why do I get a "Warning: You did not close the PDF Document"?
	</question>
	<answer>
	<p>
	You need to call close() on the PDDocument inside the finally block, if you
	don't then the document will not be closed properly. Also, you must close all
	PDDocument objects that get created. The following code creates <b>two</b>
	PDDocument objects; one from the "new PDDocument()" and the second by the load method.
	</p>
	<pre>
	PDDocument doc = new PDDocument();
	try
	{
	doc = PDDocument.load( "my.pdf" );
	}
	finally
	{
	if( doc != null )
	{
	doc.close();
	}
	}
	</pre>

	</answer>
	</faq>


	</part>

	<part id="text_extraction">
	<title>Text Extraction</title>
	<faq id="no_text_extraction">
	<question>
	How come I am not getting any text from the PDF document?
	</question>
	<answer>
	<p>
	Text extraction from a pdf document is a complicated task and there are many factors
	involved that effect the possibility and accuracy of text extraction. It would be helpful
	to the PDFBox team if you could try a couple things.
	</p>
	<ul>
	<li>Open the PDF in Acrobat and try to extract text from there. If Acrobat can extract text
	then PDFBox should be able to as well and it is a bug if it cannot. If Acrobat cannot extract text then
	PDFBox 'probably' cannot either.</li>
	<li>It might really be an image instead of text. Some PDF documents are just images that have
	been scanned in. You can tell by using the selection tool in Acrobat, if you can't select
	any text then it is probably an image.</li>
	</ul>

	</answer>
	</faq>
	<faq id="gibberish_text">
	<question>
	How come I am getting gibberish(G38G43G36G51G5) when extracting text?
	</question>
	<answer>
	<p>
	This is because the characters in a PDF document can use a custom encoding
	instead of unicode or ASCII. When you see gibberish text then it
	probably means that a meaningless internal encoding is being used. The
	only way to access the text is to use OCR. This may be a future
	enhancement.
	</p>
	</answer>
	</faq>
	<faq id="cant_handle_font_width">
	<question>
	What does "java.io.IOException: Can't handle font width" mean?
	</question>
	<answer>
	<p>
	This probably means that the "Resources" directory is not in your classpath. The
	Resources directory is included in the PDFBox jar so this is only a problem if you
	are building PDFBox yourself and not using the binary.
	</p>
	</answer>
	</faq>
	<faq id="no_permission">
	<question>
	Why do I get "You do not have permission to extract text" on some documents?
	</question>
	<answer>
	<p>
	PDF documents have certain security permissions that can
	be applied to them and two passwords associated with them, a user password and a master password.
	If the "cannot extract text" permission bit is set then you need
	to decrypt the document with the master password in order to extract the text.
	</p>
	</answer>
	</faq>
	<faq id="parse_whole_document">
	<question>Can't we just extract the text without parsing the whole document or extract text as it is parsed.</question>
	<answer>
	<p>
	Not really, for a couple reasons.
	</p>
	<ol>
	<li>If the document is encrypted then you need to parse at least until the encryption dictionary before you can decrypt.</li>
	<li>Sometimes the PDFont contains vital information needed for text extraction.</li>
	<li>Text on a page does not have to be drawn in reading order. For example; if the page said "Hello World", the pdf could
	have been written such that "World" gets drawn and then the cursor moves to the left and the word "Hello" is drawn.</li>
	</ol>
	</answer>
	</faq>
	</part>

	</faqs>