blob: bf046258e7f65192e1e14857e28796e00ef761c1 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
====================================================================
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
====================================================================
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
<document>
<header>
<title>POI-HWPF - A Quick Guide</title>
<subtitle>Overview</subtitle>
<authors>
<person name="Nick Burch" email="nick at torchbox dot com"/>
</authors>
</header>
<body>
<p>HWPF is still in early development. It is in the <link
href="http://svn.apache.org/viewcvs.cgi/poi/trunk/src/scratchpad/">
scratchpad section of the SVN.</link> You will need to ensure you
either have a recent SVN checkout, or a recent SVN nightly build
(including the scratchpad jar!)</p>
<section><title>Basic Text Extraction</title>
<p>For basic text extraction, make use of
<code>org.apache.poi.hwpf.extractor.WordExtractor</code>. It accepts an input
stream or a <code>HWPFDocument</code>. The <code>getText()</code>
method can be used to
get the text from all the paragraphs, or <code>getParagraphText()</code>
can be used to fetch the text from each paragraph in turn. The other
option is <code>getTextFromPieces()</code>, which is very fast, but
tends to return things that aren't text from the page. YMMV.
</p>
</section>
<section><title>Specific Text Extraction</title>
<p>To get specific bits of text, first create a
<code>org.apache.poi.hwpf.HWPFDocument</code>. Fetch the range
with <code>getRange()</code>, then get paragraphs from that. You
can then get text and other properties.
</p>
</section>
<section><title>Changing Text</title>
<p>It is possible to change the text via
<code>insertBefore()</code> and <code>insertAfter()</code>
on a <code>Range</code> object (either a <code>Range</code>,
<code>Paragraph</code> or <code>CharacterRun</code>).
It is also possible to delete a <code>Range</code>, but this
code is know to have bugs in it.
</p>
</section>
<section><title>Further Examples</title>
<p>For now, the best source of additional examples is in the unit
tests. <link
href="http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/testcases/org/apache/poi/hwpf/">
Browse the HWPF unit tests.</link>
</p>
</section>
</body>
</document>