| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> |
| <HTML> |
| <head> |
| <META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1"> |
| <TITLE>XML Packages</TITLE> |
| <META NAME="GENERATOR" CONTENT="StarOffice 6.0 (Win32)"> |
| <META NAME="CREATED" CONTENT="20010801;20531125"> |
| <META NAME="CHANGEDBY" CONTENT="Daniel Vogelheim"> |
| <META NAME="CHANGED" CONTENT="20010801;20534064"> |
| </head> |
| <body> |
| <H1>XML Packages</H1> |
| <H2>Purpose of this document</H2> |
| <P>At the end of September, I (<I>dvo</I>) have asked the |
| openoffice.org community for help on the problem of embedding binary |
| content in XML documents. This page is intended to summarize the |
| discussion and provide a solid foundation that will guide our |
| implementation efforts. |
| </P> |
| <P>We will first restate the problem. Then follows a summary of the |
| requirements that have been stated in the <A HREF="http://www.openoffice.org/www-discuss/">discuss</A> |
| and <A HREF="http://xml.openoffice.org/xml-dev/">xml-dev</A> mailing |
| lists. The candidate solutions mentioned on the mailing lists are |
| then examined in light of these requirements. Finally, a conclusion |
| is presented. |
| </P> |
| <H3>A few definitions</H3> |
| <P>To clarify the following discussion, a definition of a few terms |
| for use within this document may be in order: |
| </P> |
| <DL> |
| <DT STYLE="margin-bottom: 0.5cm">document |
| </DT><DD STYLE="margin-bottom: 0.5cm"> |
| A document comprises everything a user considers to part of his |
| text, spreadsheet or presentation. In general this will include |
| several images or OLE objects in addition to the main content. |
| </DD><DD STYLE="margin-bottom: 0.5cm"> |
| If images are linked into the document, they are not considered part |
| of it. Whether an image is linked or embedded into the document is |
| determined by the user. Due to limitations in the current OLE |
| handling, OLE objects must always be embedded. |
| </DD><DT STYLE="margin-bottom: 0.5cm"> |
| file |
| </DT><DD STYLE="margin-bottom: 0.5cm"> |
| A file is a sequence of bytes as stored in the operating system's |
| file system. This fairly obvious definition serves to distinguish |
| between the logical view (the document) and the physical view (the |
| file). |
| </DD><DT STYLE="margin-bottom: 0.5cm"> |
| <EM>b</EM>inary <EM>l</EM>arge <EM>ob</EM>ject, BLOB |
| </DT><DD STYLE="margin-bottom: 0.5cm"> |
| A BLOB contains binary data that has no structured content that |
| could reasonably be encoded as XML data. Usually, BLOBs will be data |
| in established binary formats such as GIF, JPEG, or OLE data |
| streams. |
| </DD><DT STYLE="margin-bottom: 0.5cm"> |
| subdocument |
| </DT><DD STYLE="margin-bottom: 0.5cm"> |
| A subdocument refers to an individual component of the document. |
| Each binary large object (BLOB) is one subdocument, as is the main |
| document content encoded as XML. |
| </DD><DT STYLE="margin-bottom: 0.5cm"> |
| main document content |
| </DT><DD STYLE="margin-bottom: 0.5cm"> |
| The main document content refers to the subdocument that best |
| represents the document as a whole. For example, if an OpenOffice |
| Writer document consists of an XML subdocument containing the text |
| body, and several image subdocument, then the XML subdocument would |
| form the main document content. |
| </DD></DL> |
| <H2> |
| The Problem</H2> |
| <P>OpenOffice and the next generation of StarOffice by default use an |
| XML based file format (but the user may choose to save in a different |
| format). Therefore this project differs from many others, as this is |
| an XML based file format intended to be truly a format for the |
| masses. The requirements for a native file format are different |
| (usually more demanding) than those for an interchange format. |
| </P> |
| <P>XML is meant for structured content and has no native support for |
| binary objects (BLOBs) such as images, OLE objects or other media |
| types. Since embedding of binary objects is needed for office |
| documents, a way to handle XML and binary data in the same document |
| must be provided. |
| </P> |
| <P>One solution may be to simply store the XML data in a file and |
| link to image data in separate files, much like in HTML documents. |
| While linking files at the user's request should be possible, this is |
| in general not considered adequate for office documents. A document |
| should be stored in a single file to make external handling of the |
| document (e.g. copying the document) possible. Therefore, binary |
| content must be stored in the same file as the XML content, requiring |
| either a <EM>package</EM> format or a means to encode binary content |
| in XML. For the sake of simplicity, we will refer to all solutions |
| that store XML and BLOB data in a single file as packages. |
| </P> |
| <H2>The Requirements</H2> |
| <P>The following requirements have been identified for the package |
| format: |
| </P> |
| <H3>A. Efficient Operation</H3> |
| <P>Users have traditionally required efficient operation, especially |
| for basic functionality such as loading or saving of documents. It is |
| our experience that independent loading (or saving) of subdocuments |
| is the key to efficient handling of large documents. Additionally, |
| users require that disk space is used efficiently. This leads us to |
| require the document format to support: |
| </P> |
| <OL> |
| <LI><P STYLE="margin-bottom: 0cm">small file sizes |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">on-demand loading (i.e., |
| independent loading of subdocuments) |
| </P> |
| <LI><P>independent saving of subdocuments |
| </P> |
| </OL> |
| <P>These requirements are strongly supported by ten years of |
| experience at StarOffice. Failing to meet these requirements has |
| repeatedly resulted in dissatisfaction and significant protests from |
| the user base. |
| </P> |
| <P><B>Note</B> on independent saving of subdocuments: Previous |
| versions of StarOffice have gotten significant benefit from copying |
| subdocuments on the file system level rather than reading and then |
| writing subdocuments through the <A HREF="http://ucb.openoffice.org/">UCB</A>. |
| This should be supported by the package format. Saving modified |
| subdocuments into an existing (compound) document may provide further |
| benefits. |
| </P> |
| <H3>B. Compatibility with Existing Tools</H3> |
| <P>A primary motivation for creating an XML based file format is the |
| ability to process, create and manipulate StarOffice documents with |
| external tools. To deserve the label "open" document |
| format, these tools should be standard and widely available. As far |
| as possible, this should apply to both, the document and the |
| individual subdocuments (if applicable). In particular, the main |
| document content (using XML) should be accessible to standard XML |
| tools. |
| </P> |
| <P>This leads us to state three subrequirements: |
| </P> |
| <OL> |
| <LI><P STYLE="margin-bottom: 0cm">Subdocuments should be usable with |
| standard tools. In particular, XML subdocuments should be usable |
| with standard XML tools, e.g. XSLT transformations. |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">The document should be accessible |
| with standard tools, making it possible to insert, modify and |
| extract subdocuments, e.g. similar to zip or other popular |
| archivers. |
| </P> |
| <LI><P>The document should be accessible with ASCII-based tools, |
| e.g. the Windows file find function. |
| </P> |
| </OL> |
| <P><B>Note:</B> During discussion on openoffice.org mailing lists |
| several people preferred requirement number 1 to be extended to the |
| complete document. This would essentially require the document format |
| to be XML as well, with subdocuments (including BLOBS) being embedded |
| within the XML structure. |
| </P> |
| <P><B>Note:</B> Since XML is based on Unicode, ASCII based tools |
| cannot generally be assumed to work on XML files. Thus, no solution |
| will be able to fully support the third requirement. If UTF-8 |
| encoding is used, ASCII-based tools will work at least for those |
| languages that can be properly represented in ASCII, like English or |
| Latin. |
| </P> |
| <P><B>Note:</B> Compression interferes with some of the requirements |
| above. The discussion on the mailing list has clearly proven that |
| those requirements are very important to some users. To support these |
| users, |
| </P> |
| <OL> |
| <LI><P STYLE="margin-bottom: 0cm">compression should be made |
| optional, and |
| </P> |
| <LI><P>an additional implementation that writes pure XML (and saves |
| binary data into separate files) should be created. |
| </P> |
| </OL> |
| <P>Formats supporting compression on a per subdocument basis would |
| have an additional advantage in this respect. Also, this would allow |
| a subdocument with e.g. meta information (title, keywords, etc.) to |
| be stored uncompressed, making it more easily accessible. |
| </P> |
| <H3>C. Security</H3> |
| <P>An additional advantage that may be gained through the use of a |
| package is easy support for document security. We can distinguish |
| between two security considerations: |
| </P> |
| <OL> |
| <LI><P STYLE="margin-bottom: 0cm">privacy: be able to encrypt |
| (partial) documents |
| </P> |
| <LI><P>integrity: be able to verify origin of (partial) documents |
| </P> |
| </OL> |
| <P><B>Note:</B> These requirements are of fairly low priority. |
| </P> |
| <H3>D. Additional Issues</H3> |
| <OL> |
| <LI><P>A package structure supporting documents and subdocuments |
| makes it easy to add additional information to a document, even if |
| the reference implementation does not understand it. For example, in |
| situations where a certain transformation is used very often, the |
| transformation result may be stored in the package as well. |
| </P> |
| <LI><P>The current implementation is based on the SAX API. XML |
| filters (transformations) also using the SAX API can be pipelined, |
| e.g. they can import (or export) data into (or from) OpenOffice by |
| going directly through the API, rather than going through a file. |
| </P> |
| </OL> |
| <H2>The Contenders</H2> |
| <P>On the openoffice.org mailing lists, the following possible |
| solutions were suggested: |
| </P> |
| <OL> |
| <LI><P STYLE="margin-bottom: 0cm">ZIP or JAR files |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">XML with binary data being |
| ASCII-encoded within special tags (e.g. base64) |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">MIME files |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">.tgz files |
| </P> |
| <LI><P>BONOBO libefs |
| </P> |
| </OL> |
| <P><B>Note:</B> The participants quickly focused on two choices: JAR |
| and XML |
| </P> |
| <P><B>Note:</B> ZIP and JAR files are identical for most intents and |
| purposes. JAR differs from the older ZIP in the file ending, as well |
| as in additional meta information stored in a directory called |
| META-INF. JAR files can be accessed using unmodified ZIP tools. |
| </P> |
| <H2>Examination of the contenders</H2> |
| <P>This chapter gives an overview of how well the various contenders |
| met the specified requirements. It was my (<I>dvo</I>) impression |
| that most people would agree on the suitability of the various |
| contenders for the various requirements. The disagreement was mainly |
| between the importance of the requirements. |
| </P> |
| <P>In particular one group valued accessibility with XML tools (B1) |
| very high and consequently would speak for XML with base64 encoding, |
| while most others would prefer ZIP/JAR. |
| </P> |
| <H3>ZIP / JAR</H3> |
| <P>Tools to create and manipulate ZIP files are widely available on |
| all platforms. The manifest file used with JAR files is usually |
| considered optional and may even be created using a text editor. |
| Access to subdocuments requires unzipping the required subdocuments |
| first. ASCII based tools will in general not work on the package, |
| although (at the user's request) individual subdocuments could be |
| stored uncompressed, thus making them available to ASCII based tools.</P> |
| <P>Since ZIP files have an index, efficient on-demand loading of |
| subdocuments can be achieved. Subdocuments can be copied from one |
| package to another without uncompressing and compressing them first. |
| With some hackery for on-demand saving, only those files following |
| the newly-written subdocuments need to be written. Due to (optional) |
| compression, files can be small, although not quite as small as .tgz |
| files. |
| </P> |
| <P><B>Note:</B> As requested on the mailing list, the JAR manifest |
| file should be replaced by an XML-based manifest file. The older JAR |
| manifest may be supported for compatibility reasons. This is to be |
| determined. |
| </P> |
| <CENTER> |
| <TABLE BORDER=1 CELLPADDING=2 CELLSPACING=2> |
| <TR> |
| <TD></TD> |
| <TH COLSPAN=3> |
| <P>Efficiency</P> |
| </TH> |
| <TH COLSPAN=3> |
| <P>Standard Tools</P> |
| </TH> |
| </TR> |
| <TR> |
| <TD></TD> |
| <TH> |
| <P>size</P> |
| </TH> |
| <TH> |
| <P>load</P> |
| </TH> |
| <TH> |
| <P>save</P> |
| </TH> |
| <TH> |
| <P>XML</P> |
| </TH> |
| <TH> |
| <P>document</P> |
| </TH> |
| <TH> |
| <P>ASCII</P> |
| </TH> |
| </TR> |
| <TR> |
| <TH> |
| <P>ZIP / JAR</P> |
| </TH> |
| <TD> |
| <P ALIGN=CENTER>+</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>+</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>++</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0</P> |
| </TD> |
| </TR> |
| </TABLE> |
| </CENTER> |
| <H3>XML + base64</H3> |
| <P>This suggestion is to use plain XML documents and embed binary |
| content as base64 encoded ASCII in special elements. |
| </P> |
| <P>Tools to generate or manipulate subdocuments in this format are |
| not commonly available, but could be written with reasonable effort. |
| Usability of the main document content with XML tools is, of course, |
| excellent. ASCII based tools should work with no problems (except for |
| the already mentioned UTF encodings). |
| </P> |
| <P>Without an index, on-demand loading is not possible, neither is |
| on-demand saving. Files are not compressed and binary subdocuments |
| are even expand by 33% due to the base64 encoding. |
| </P> |
| <P><B>Note:</B> In the discussion lists, several remedies to allow |
| on-demand reading were suggested. Either the inclusion of indices in |
| the XML file, or placing the binary data at the end of the file. The |
| former was considered to be non-XML (as it relies on the physical |
| layout of the XML data) and useless, since any standard XML tool |
| would not update the index and thus OpenOffice couldn't rely on it. |
| The latter can not be done with standard (e.g. SAX-based) parsers, as |
| they are used in the current implementation. The SAX API is necessary |
| for the filter pipelining mentioned in the requirements section, so |
| forgoing SAX may have significant drawbacks. |
| </P> |
| <CENTER> |
| <TABLE BORDER=1 CELLPADDING=2 CELLSPACING=2> |
| <TR> |
| <TD></TD> |
| <TH COLSPAN=3> |
| <P>Efficiency</P> |
| </TH> |
| <TH COLSPAN=3> |
| <P>Standard Tools</P> |
| </TH> |
| </TR> |
| <TR> |
| <TD></TD> |
| <TH> |
| <P>size</P> |
| </TH> |
| <TH> |
| <P>load</P> |
| </TH> |
| <TH> |
| <P>save</P> |
| </TH> |
| <TH> |
| <P>XML</P> |
| </TH> |
| <TH> |
| <P>document</P> |
| </TH> |
| <TH> |
| <P>ASCII</P> |
| </TH> |
| </TR> |
| <TR> |
| <TH> |
| <P>XML + base64</P> |
| </TH> |
| <TD> |
| <P ALIGN=CENTER>--</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>-</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>-</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>++</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>+</P> |
| </TD> |
| </TR> |
| </TABLE> |
| </CENTER> |
| <H3>MIME</H3> |
| <P>MIME is the established packaging format for emails. As is |
| required for SMTP compatibility, it is ASCII-based (7bit ASCII). |
| Non-Mailer tools to manipulate MIME files are rare. |
| </P> |
| <P>Being an ASCII based format, accessibility to ASCII based tools is |
| excellent. However, the encoding of non-ASCII characters is solved |
| differently than in XML. Tools to manipulate MIME files (outside of |
| mail programs) are not readily available, but could be written with |
| reasonable effort. Access with XML tools requires unpacking first, |
| but as said before the tools to do so are not currently available. |
| </P> |
| <P>MIME has no index of subdocuments, making on-demand loading |
| difficult. For saving of individual subdocuments, a reasoning similar |
| to ZIP files applies. However, copying subdocuments requires |
| accessing them, which is slow on MIME due to the lack of an index. |
| Since no compression is used, files get fairly large. In addition, |
| binary documents are usually encoded in base64, enlarging them by |
| 33%. |
| </P> |
| <CENTER> |
| <TABLE BORDER=1 CELLPADDING=2 CELLSPACING=2> |
| <TR> |
| <TD></TD> |
| <TH COLSPAN=3> |
| <P>Efficiency</P> |
| </TH> |
| <TH COLSPAN=3> |
| <P>Standard Tools</P> |
| </TH> |
| </TR> |
| <TR> |
| <TD></TD> |
| <TH> |
| <P>size</P> |
| </TH> |
| <TH> |
| <P>load</P> |
| </TH> |
| <TH> |
| <P>save</P> |
| </TH> |
| <TH> |
| <P>XML</P> |
| </TH> |
| <TH> |
| <P>document</P> |
| </TH> |
| <TH> |
| <P>ASCII |
| </P> |
| </TH> |
| </TR> |
| <TR> |
| <TH> |
| <P>MIME</P> |
| </TH> |
| <TD> |
| <P ALIGN=CENTER>--</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>-</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>-</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>-</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>+</P> |
| </TD> |
| </TR> |
| </TABLE> |
| </CENTER> |
| <H3>.tgz</H3> |
| <P>The .tgz format (tar files compressed with gzip) is the most |
| popular archiving format on the UNIX platform. Tools to manipulate |
| .tgz files exist on most other platforms as well. As participants on |
| the mailing lists mentioned, it is used as package format for |
| KOffice. |
| </P> |
| <P>.tgz operates according to the compress-after-packaging principle. |
| in general, this results in very good compression ratios. However, |
| the same principle causes on-demand loading of subdocument to be |
| severely inefficient: Not only is it impossible to quickly determine |
| the start position of subdocuments, it is also impossible to start |
| reading subdocuments once the start position is known because the |
| state of the uncompression tables cannot easily be reconstructed. For |
| the same reason, changing a subdocument or even copying subdocuments |
| is inefficient. |
| </P> |
| <P>Availability of tools for manipulating .tgz files is good, and |
| even very good on UNIX platforms. Accessibility to subdocuments |
| requires unpacking the subdocuments first, which is slow because the |
| document must be uncompressed from the start. |
| </P> |
| <CENTER> |
| <TABLE BORDER=1 CELLPADDING=2 CELLSPACING=2> |
| <TR> |
| <TD></TD> |
| <TH COLSPAN=3> |
| <P>Efficiency</P> |
| </TH> |
| <TH COLSPAN=3> |
| <P>Standard Tools</P> |
| </TH> |
| </TR> |
| <TR> |
| <TD></TD> |
| <TH> |
| <P>size</P> |
| </TH> |
| <TH> |
| <P>load</P> |
| </TH> |
| <TH> |
| <P>save</P> |
| </TH> |
| <TH> |
| <P>XML</P> |
| </TH> |
| <TH> |
| <P>document</P> |
| </TH> |
| <TH> |
| <P>ASCII</P> |
| </TH> |
| </TR> |
| <TR> |
| <TH> |
| <P>.tgz</P> |
| </TH> |
| <TD> |
| <P ALIGN=CENTER>++</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>--</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>--</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>+</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0</P> |
| </TD> |
| </TR> |
| </TABLE> |
| </CENTER> |
| <H3>Comparison Table</H3> |
| <CENTER> |
| <TABLE BORDER=1 CELLPADDING=2 CELLSPACING=2> |
| <TR> |
| <TD></TD> |
| <TH COLSPAN=3> |
| <P>Efficiency</P> |
| </TH> |
| <TH COLSPAN=3> |
| <P>Standard Tools</P> |
| </TH> |
| </TR> |
| <TR> |
| <TD></TD> |
| <TH> |
| <P>size</P> |
| </TH> |
| <TH> |
| <P>load</P> |
| </TH> |
| <TH> |
| <P>save</P> |
| </TH> |
| <TH> |
| <P>XML</P> |
| </TH> |
| <TH> |
| <P>document</P> |
| </TH> |
| <TH> |
| <P>ASCII [6]</P> |
| </TH> |
| </TR> |
| <TR> |
| <TH> |
| <P>ZIP / JAR</P> |
| </TH> |
| <TD> |
| <P ALIGN=CENTER>+</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>+</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0 [1]</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0 [2]</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>++</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0 [2]</P> |
| </TD> |
| </TR> |
| <TR> |
| <TH> |
| <P>XML + base64</P> |
| </TH> |
| <TD> |
| <P ALIGN=CENTER>--</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>- [3]</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>-</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>++</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0 [4]</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>+</P> |
| </TD> |
| </TR> |
| <TR> |
| <TH> |
| <P>MIME</P> |
| </TH> |
| <TD> |
| <P ALIGN=CENTER>--</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>-</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>- [1]</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>- [2]</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0 [4]</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>+</P> |
| </TD> |
| </TR> |
| <TR> |
| <TH> |
| <P>.tgz</P> |
| </TH> |
| <TD> |
| <P ALIGN=CENTER>++</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>--</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>--</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0 [2]</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>+ [5]</P> |
| </TD> |
| <TD> |
| <P ALIGN=CENTER>0 [2]</P> |
| </TD> |
| </TR> |
| </TABLE> |
| </CENTER> |
| <P ALIGN=CENTER>++ = very good, + = good, |
| 0 = medium, - = bad, -- = very bad</P> |
| <P>Remarks:</P> |
| <OL> |
| <LI><P STYLE="margin-bottom: 0cm">Copying subdocuments is possible. |
| With some hackery, (repeated) partial updates can be supported. |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">Requires unpackaging (or |
| package-aware tools), so score depends on availability of tools and |
| their efficiency |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">On-demand loading may be |
| implemented at the expense of being unable to use standard XML |
| parser APIs (e.g. SAX). |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">Tools can be written with |
| reasonable effort but don't currently exist; at least not on many |
| platforms. |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">Very good on Unix, good on other |
| platforms. |
| </P> |
| <LI><P>Due to XML being Unicode-based, ASCII-based tools probably |
| don't work in many circumstances, independent of the document |
| format. |
| </P> |
| </OL> |
| <H3>libefs |
| </H3> |
| <P>Additionally, the BONOBO libefs was suggested. Since this was not |
| much discussed in the mailing lists, I (<I>dvo</I>) tried to find out |
| about libefs from the web. This was my impression: |
| </P> |
| <UL> |
| <LI><P STYLE="margin-bottom: 0cm">libefs was created to solve a |
| problem very similar to ours. It is designed to be an ambitious |
| file-system-in-a-file solution. If it works out the way its |
| developers imagine, it will be technically superior to the other |
| proposals discussed here. |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">The page didn't mention any ports, |
| thus making it a UNIX solution only. This prohibits its use in a |
| cross-platform application like OpenOffice. |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">No external tools to manipulate |
| libefs files exist. |
| </P> |
| <LI><P STYLE="margin-bottom: 0cm">The page I consulted referred to |
| it as "very much a work-in-progress", meaning that |
| currently it is not suitable for a consumer product. |
| </P> |
| <LI><P>The libefs documentation was a bit unclear on the |
| relationship between libefs API and libefs implementation (meaning: |
| file format(s)). If it is mainly an API, it would compete with the |
| current <A HREF="http://ucb.openoffice.org/">UCB</A> architecture. |
| </P> |
| </UL> |
| <H3>Security</H3> |
| <P>Documents could of course be encrypted for all candidate formats. |
| All formats could encrypt subdocuments as well, except for |
| XML+base64, which would make it impossible to encrypt the main |
| document content without encrypting the binaries as well. In .tgz, |
| subdocument encryption could adversely affect compression, since the |
| entire file is compressed at once. |
| </P> |
| <P>The same applies to file integrity verification. JAR is very |
| appealing in this regard, as it has already the full infrastructure |
| for subdocument and document integrity verification in place. |
| </P> |
| <H2>The Conclusion</H2> |
| <P>The JAR file format seems to provide the best balance among the |
| stated requirements. Unlike all other formats, it does not have any |
| real weakness in regard to any of the requirements, and quite a lot |
| of advantages. |
| </P> |
| <P>If one values accessibility with standard XML tools very high, XML |
| + base64 seems a logical choice. While this is one of the |
| requirements for OpenOffice, the xmloff development team does not |
| follow the opinion that this single requirement significantly exceeds |
| all other requirements in importance. To accommodate this user base, |
| an additional pure XML file format with external subdocuments should |
| be created. |
| </P> |
| <H3>Remaining Issues</H3> |
| <UL> |
| <LI><P STYLE="margin-bottom: 0cm">Decide on meta information and |
| manifest file format. |
| </P> |
| <LI><P>Implement a UCB that allows pure XML documents. |
| </P> |
| </UL> |
| <H3>Further Reading</H3> |
| <UL> |
| <LI><P STYLE="margin-bottom: 0cm">Most of these topics have been |
| discussed in detail on the openoffice.org <A HREF="http://www.openoffice.org/www-discuss/">discuss</A> |
| and <A HREF="http://xml.openoffice.org/xml-dev/">xml-dev</A> mailing |
| lists. The archives should provide more information. |
| </P> |
| <LI><P>Details of ZIP and many other file formats are described on |
| <A HREF="http://www.wotsit.org/">www.wotsit.org</A>. |
| </P> |
| </UL> |
| <HR> |
| <P><FONT SIZE=2>The <A HREF="http://xml.openoffice.org/">xmloff</A> |
| development team <BR>Summarized by <A HREF="mailto:dvo@openoffice.org">dvo</A> |
| </FONT> |
| </P> |
| </body> |
| </HTML> |