blob: 8133364461bcc09415ce35e31b32378fd07e3cf5 [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<!--
#**************************************************************
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
#**************************************************************
-->
<html>
<head>
<title>org.openoffice.xmerge.converter.xml.sxw.aportisdoc package</title>
</head>
<body bgcolor="white">
<p>Provides the tools for doing the conversion of StarWriter XML to
and from AportisDoc format.</p>
<p>It follows the {@link org.openoffice.xmerge} framework for the conversion process.</p>
<p>Since it converts to/from a Palm application format, these converters
follow the <a href=../../../../converter/palm/package-summary.html#streamformat>
<code>PalmDB</code> stream format</a> for writing out to the Palm sync client or
reading in from the Palm sync client.</p>
<p>Note that <code>PluginFactoryImpl</code> also provides a
<code>DocumentMerger</code> object, i.e. {@link org.openoffice.xmerge.converter.xml.sxw.aportisdoc.DocumentMergerImpl DocumentMergerImpl}.
This functionality was derived from its superclass
{@link org.openoffice.xmerge.converter.xml.sxw.SxwPluginFactory
SxwPluginFactory}.</p>
<h2>AportisDoc pdb format - Doc</h2>
<p>The AportisDoc pdb format is widely used by different Palm applications,
e.g. QuickWord, AportisDoc Reader, MiniWrite, etc. Note that some
of these applications put tweaks into the format. The converters will only
support the default AportisDoc format, plus some very minor tweaks to accommodate
other applications.</p>
<p>The text content of the format is plain text, i.e. there are no styles
or structures. There is no notion of lists, list items, paragraphs,
headings, etc. The format does have support for bookmarks.</p>
<p>For most Doc applications, the default character encoding supported is
the extended ASCII character set, i.e. ISO-8859-1. StarWriter XML is in
UTF-8 encoding scheme. Since UTF-8 encoding scheme covers more characters,
converting UTF-8 strings into extended ASCII would mean that there can be
possible loss of character mappings.</p>
<p>Using JAXP, XML files can be parsed and read in as Java <code>String</code>s
which is in Unicode format, there is no loss of character mapping from UTF-8
to Java Strings. There is possible loss of character mapping in
converting Java <code>String</code>s to ASCII bytes. Java characters that
cannot be represented in extended ASCII are converted into the ASCII
character '?' or x3F in hex digit via the <code>String.getBytes(encoding)</code>
API.</p>
<h2>SXW to DOC Conversion</h2>
<p>The <code>DocumentSerializerImpl</code> class implements the
<code>org.openoffice.xmerge.DocumentSerializer</code>.
This class specifically provides the conversion process from a given
<code>SxwDocument</code> object to DOC formatted records, which are
then passed back to the client via the <code>ConvertData</code> object.</p>
<p>The following XML tags are handled. [Note that some may not be implemented yet.]</p>
<ul>
<li>
<p>Paragraphs <tt>&lt;text:p&gt;</tt> and Headings <tt>&lt;text:h&gt;</tt></p>
<p>Heading elements are classified the same as paragraph
elements since both have the same possible elements inside.
Their main difference is that they refer to different types
of style information, which is outside of their element tags.
Since there are no styles on the DOC format, headings should
be treated the same way a paragraph is converted.</p>
<p>For paragraph elements, convert and transfer text nodes
that are essential. Text nodes directly contained within paragraph
nodes are such. There are also a number of elements that
a paragraph element may contain. These are explained in their
own context.</p>
<p>At the end of the paragraph, an EOL character is added by
the converter to provide a separation for each paragraph,
since the Doc format does not have a notion of a paragraph.</p>
</li>
<li>
<p>White spaces <tt>&lt;text:s&gt;</tt> and Tabs <tt>&lt;text:tab-stop&gt;</tt></p>
<p>In SXW, normally 2 or more white-space characters are collapsed into
a single space character. In order to make sure that the document
content really contains those white-space characters, there are special
elements assigned to them.</p>
<p>The space element specifies the number of spaces are in it.
Thus, converting it just means providing the specific number of spaces
that the element requires.</p>
<p>There is also the tab-stop element. This is a bit tricky. In a
StarWriter document, tab-stops are specified by a column position.
A tab is not an exact number of space, but rather a specific column
positioning. Say, regular tab-stops are set at every 5th column.
At column 4, if I hit a tab, it goes to column 5. At column 1, hitting
a tab would put the cursor at column 5 as well. SmartDoc and AporticDoc
applications goes by columns for the ASCII tab character. The only problem
is that in StarWriter, one could specify a different tab-stop, but not
in most of these Doc applications, at least I have not seen one.
Solution for this is just to go with the converting to the ASCII tab
character and not do anything for different tab-stop positioning.</p>
</li>
<li>
<p>Line breaks <tt>&lt;text:line-break&gt;</tt></p>
<p>To represent line breaks, it is simpliest to just put an ASCII LF
character. Note that the side effect of this is that an end of paragraph
also contains an ASCII LF character. Thus, for the DOC to SXW conversion,
line breaks are not distinguishable from specifying the end of a
paragraph.</p>
</li>
<li>
<p>Text spans <tt>&lt;text:span&gt;</tt></p>
<p>Text spans contain text that have different style attributes
from the paragraphs'. Text spans can be embedded within another
text span. Since it is purely for style tagging, we only needed
to convert and transfer the text elements within these.</p>
</li>
<li>
<p>Hyperlinks <tt>&lt;text:a&gt;</tt>
<p>Convert and transfer the text portion.</p>
</li>
<li>
<p>Bookmarks <tt>&lt;text:bookmark&gt;</tt> <tt>&lt;text:bookmark-start&gt;</tt>
<tt>&lt;text:bookmark-end&gt;</tt> [Not implemented yet]</p>
<p>In SXW, bookmark elements are embedded inside paragraph elements.
Bookmarks can either mark a text position or a text range. <tt>&lt;text:bookmark&gt;</tt>
marks a position while the pair <tt>&lt;text:bookmark-start&gt;</tt> and
<tt>&lt;text:bookmark-end&gt;</tt></p> marks a text range. The DOC format only
supports bookmarking a text position. Thus, for the conversion,
<tt>&lt;text:bookmark&gt;</tt> and <tt>&lt;text:bookmark-start&gt;</tt> will both mark
a text position.</p>
</li>
<li>
<p>Change Tracking <tt>&lt;text:tracked-changes&gt;</tt>
<tt>&lt;text:change*&gt;</tt> [Not implemented yet]</p>
<p>Change tracking elements are not supported yet on the current
OpenOffice.org XML filters, will have to watch out on this. The text
within these elements have to be interpreted properly during the
conversion process.</p>
</li>
<li>
<p>Lists <tt>&lt;text:unordered-list&gt;</tt> and
<tt>&lt;text:ordered-lists&gt;</tt></p>
<p>A list can only contain one optional <tt>&lt;text:list-header&gt;</tt>
and one or more <tt>&lt;text:list-item&gt;</tt> elements.</p>
<p>A <tt>&lt;text:list-header&gt;</tt> contains one or more paragraph
elements. Since there are no styles, the conversion process does not
do anything special for list headers, conversion for the paragraphs
within list headers are the same as explained above.</p>
<p>A <tt>&lt;text:list-item&gt;</tt> may contain one or more of paragraphs,
headings, list, etc. Since the Doc format does not support any list
structure, there will not be any special handling for this element.
Conversion for elements within it shall be applied according to the
element type. Thus, lists with paragraphs within it will result in just
plain paragraphs. Sublists will not be identifiable. Paragraphs in
sublists will still appear.</p>
</li>
<li>
<p><tt>&lt;text:section&gt;</tt></p>
<p>I am not sure what this is yet, will need to investigate more on this.</p>
</li>
</ul>
<p>There may be other tags that will still need to be addressed for this conversion.</p>
<p>Refer to {@link org.openoffice.xmerge.converter.xml.sxw.aportisdoc.DocumentSerializerImpl DocumentSerializerImpl}
for details of implementation. It uses <code>DocEncoder</code> class to do the encoding
part.</p>
<h2>DOC to SXW Conversion</h2>
<p>The <code>DocumentDeserializerImpl</code> class implements the
<code>org.openoffice.xmerge.DocumentDeserializer</code>. It is
passed the device document in the form of a <code>ConvertData</code> object.
It will then create a <code>SxwDocument</code> object from the conversion of
the DOC formatted records.</p>
<p>The text content of the Doc format will be transferred as text. Paragraph
elements will be formed based on the existence of an ASCII LF character. There
will be at least one paragraph element.</p>
<p>Bookmarks in the Doc format will be converted to the bookmark element
<tt>&lt;text:bookmark&gt;</tt> [Not implemented yet].</p>
<h2>Merging changes</h2>
<p>As mentioned above, the <code>DocumentMerger</code> object produced by
<code>PluginFactoryImpl</code> is <code>DocumentMergerImpl</code>.
Refer to the javadocs for that package/class on its merging specifications.
</p>
<h2>TODO list</h2>
<p><ol>
<li>Investigate Palm's with different character encodings.</li>
<li>Investigate other StarWriter XML tags</li>
</ol></p>
</body>
</html>