| ============================================== |
| Design document for ODT parsing and generation |
| ============================================== |
| |
| :Author: ts |
| |
| The scope of this document is to define the design for a first implementaion of |
| ODT (Open Document Text) support in the eZ Document component. The parts of the |
| Document component designed in this document do not affect other Open Document |
| formats like spreadsheets or graphics. The goal is to define the infrastructure |
| for reading and writing ODT documents, i.e. to convert existing ODT documents |
| into the internal representation of the Document component (DocBook XML) and to |
| generate new ODT documents from the internal representation. |
| |
| ------------ |
| Requirements |
| ------------ |
| |
| The following sections describe the requirements for the ODT handling in the |
| Document component. The first section defines requirements for reading ODT, the |
| second for writing ODT and the third section defines requirements for later |
| enhancements to be kept in mind during the initial implementation. |
| |
| Import |
| ====== |
| |
| The Document component should be able to parse existing ODT documents and to |
| convert them to the internal format used by the Document component (DocBook |
| XML). Requirements for the import process are: |
| |
| - Read plain XML ODT files |
| - Parse all necessary structural ODT elements |
| - Convert ODT elements properly into equivalent or similar DocBook representations |
| - Maintaining the content semantics provided by the ODT as good as possible |
| - Maintain meta information provided by the ODT as good as possible |
| - Develop a first heuristical approach of how ODT styling information can be |
| used to determine semantics of an element. |
| |
| Export |
| ====== |
| |
| The Document component should be able to generate new ODT documents from an |
| existing internal representations (DocBook XML). Requirements for this process |
| are: |
| |
| - Write plain XML ODT files |
| - Convert DocBook representation elements to their corresponding ODT |
| representations |
| - Maintain the document structure |
| - Maintain content and metadata semantics as good as possible |
| - Styling of ODT elements. |
| |
| Later enhancements |
| ================== |
| |
| In the first step of ODT integration only rudimentary features for import and |
| export should be realized. The following ideas must be kept in mind during the |
| design and implementation, to ensure future extensibility. |
| |
| - Reading / writing of ODT package files (ZIP) |
| - ODF can be presented either as a single XML file or as a ZIP package |
| containg multiple XML files and other related files (e.g. images) in |
| addition. |
| - Reading and writing this format is not necessary from the start, but since |
| it is the default way for users to store ODT, it should be supported later |
| on. |
| - The handling of ZIP files requires a tie-in with the Archive component or |
| similar. |
| |
| ------ |
| Design |
| ------ |
| |
| In the first development cycle, only the structural conversion between ODT and |
| DocBook XML will be considered. In addition, rudimentary styling information |
| will be taken into account. The reading and writing of ODF packages is not |
| considered in this design. |
| |
| Import |
| ====== |
| |
| Three different steps are necessary to import an ODT document and convert it |
| into DocbookXml: |
| |
| 1. Read the XML data |
| 2. Preprocess the ODT representation |
| 3. Actual conversion to DocBook XML representation |
| |
| Step 1 will be performed through the DOM extension in PHP, the internal |
| representation of an ODT will be a DOM treee. The second step performs |
| pre-processing on this DOM tree. Pre-processing is e.g. needed to assign |
| additional semantics to the ODT elements to achieve a better rendering. |
| Finally, the pre-processed DOM tree will be visited, to achieve the actual |
| creation of the DocBook XML representation. |
| |
| Pre-processing |
| -------------- |
| |
| The step of pre-processing the ODT representation is necessary to assign |
| DocBoox semantics to the ODT elements. ODT and DocBook XML have some |
| similarities, but also differ widely in some parts. The pre-processing step |
| performs manipultations on the ODT representation and potentially adds |
| information which is utilized by the latter conversion step to create a correct |
| semantical representation. |
| |
| This process works similar to filters in the XHTML document import. The class |
| level design of this feature is inspired by the XHTML handling: Filters can be |
| registered which pre-process the incoming ODT in the given order. |
| |
| A filter may process the following steps on a DOMElement: |
| |
| - Add type information to an XML element to determine into which DocBook XML |
| element the element will be converted |
| - Add attribute information to determine the attributes in the DocBook XML |
| representation |
| - Add additional elements or element hierarchies |
| |
| The resulting DOM tree must not necessarily be valid ODT anymore, to reflect |
| the latter DocBook structure in a better way. |
| |
| The first implemented filter will only perform rudimentary operations on the |
| DOM to assign basic semantical information to the elements. A second |
| implementation will be an additional filter which takes some styling |
| information into account to enhance this information. Futher filters can be |
| implemented by third parties to extend or replace these mechanisms. |
| |
| Conversion |
| ---------- |
| |
| The conversion process itself will mostly visit the DOM tree and utilize the |
| information, attached to the elements in the pre-processing step, to generate a |
| DocBook XML with the corresponding content. The filter pre-processing step is |
| responsible to annotate all significant elements properly so that the |
| conversion can use them. |
| |
| Flat ODT documents (consisting of only 1 XML file), which will purely be |
| handled in the first version of ODT support, may contain image content embeded. |
| To extract those, the user my specify a target directory or the system temp dir |
| will be used as the default. The content will then be referenced in DocBook |
| from this location. |
| |
| .. note:: We should check if it is possible to define and handle data URLs in |
| docbook. May be problematic with other formats though. (kn) |
| |
| Export |
| ====== |
| |
| .. note:: First sentence a bit unclear ;) (kn) |
| |
| The export process for ODT works similar to PDF rendering, except for that is a |
| little bit less strict. The internal DocBook representation is converted to the |
| desired ODT representation according to its semantics. |
| |
| Based on the DocBook XML elements, the user can define styles using a |
| simplified CSS syntax (see PDF). Each of the style definitions is converted to |
| an automatic style in the resulting ODT document. ODT elements affected by a |
| certain style get this style applied. |
| |
| Styles |
| ====== |
| |
| A style is defined for each styling information. There is no direct assignement |
| of layouting elements to styling information, but always a style in between. |
| The <style/> element has the following properties: |
| |
| name |
| The internal name of the style. Must be unique over all styles, in |
| concatenation with the style:family. |
| displayname |
| Name of the style to display in GUIs. If left out, the name is used. |
| family |
| Family collection of the style. One of (in context of text documents): |
| text |
| Style that might be applied to any piece of text. |
| paragraph |
| Style for complete paragraphs and headings. |
| section |
| Style to be applied to sections of text in text documents (@TODO: Not |
| handled yet!). |
| ruby |
| Not handled, yet. |
| table |
| table-column |
| table-row |
| table-cell |
| table-page |
| chart |
| default |
| graphic |
| parent-style-name |
| Identifies a parten style. Style properties of the parent are inherited and |
| maybe overwritten. If no parent style is specified, the default style for |
| the styles family will be the base for inheritence. |
| next-style-name |
| Next paragraph style. If a new paragraph is started after the element this |
| style is applied to, this paragraph will have the style named in this |
| element. Only sensible for editing in a GUI. |
| list-style-name |
| Style used in headings and paragraphs of lists contained in the styled |
| element, only if the lists have no list-style applied themselves. |
| master-page-name |
| Styles with a master page applied will force a page break before the |
| element and load the styles from the master-page then. |
| data-style-name |
| Styling of table cells (e.g. formulas, currencies, ...). |
| class |
| Information for GUIs, to sort styles into categories. |
| default-outline-level |
| "Transforms" a paragraph into some kind of heading, without making it a |
| heading itself. Senseless. |
| |
| Style mappings (replacing a style conditionally with another style) will not be |
| taken into account, yet. |
| |
| Types of styles |
| --------------- |
| |
| default-style |
| Default styles must be defined for each used style family. The default |
| style is always the base of inheritance for the style family. |
| page-layout |
| Definition of the global page properties, format and stuff. |
| header-style / footer-style |
| Styling of the header and footer area. |
| master-page |
| Definition of a master page. Defines header / footer, forms, styles for the |
| page and more. |
| |
| Table templates |
| --------------- |
| |
| Not yet handled. |
| |
| Font face declaration |
| --------------------- |
| |
| Correspond to the @font-face declaration of CSS2. |
| |
| Data styles |
| ----------- |
| |
| Not yet handled. |
| |
| List styles |
| ----------- |
| |
| Define properties of a list (not its content!). A style for each list level. If |
| no style exists for a specific level, the next lower level style is used. If |
| none is defined, a default style is used. name and display-name properties as |
| ususal. Can have the consecutive-numbering attribute defined, to specify if |
| different list levels restart numbering or not |
| |
| List styles |
| ----------- |
| |
| Define properties of a list (not its content!). A style for each list level. If |
| no style exists for a specific level, the next lower level style is used. If |
| none is defined, a default style is used. name and display-name properties as |
| ususal. Can have the consecutive-numbering attribute defined, to specify if |
| different list levels restart numbering or not. |
| |
| List-level styles |
| ^^^^^^^^^^^^^^^^^ |
| |
| A list-level style commonly has a level attribute, defining, to which |
| list-level the style is applied. All other attribute depend on the type of |
| list. A list may contain different kinds of lists, depending on the depth of |
| the level. |
| |
| Number level styles |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| Defining an enumeration list level using a list-level-style-number element. Has |
| the following attributes: |
| |
| style-name |
| Defines the text style for list item numbers. |
| num-format |
| Defines the formatting of the list item numbers. |
| display-levels |
| Defines how many level numberings to display (e.g. 1.2.3 or just 1.2). |
| start-value |
| Defines the first number to be used by the very first element of the |
| defined level. |
| |
| Bullet level style |
| ~~~~~~~~~~~~~~~~~~ |
| |
| Attributes defining a list level to be an item list. |
| |
| text-style |
| Style for the bullet character. |
| bullet-character |
| A unicode character to be used as the bullet. |
| num-format-prefix / num-format-suffix |
| Prefix and suffix to be placed before / after a bullet. |
| bullet-relative-size |
| Relative size (percentage, integer) of the bullet in respect to the item |
| content. |
| |
| Image level style |
| ~~~~~~~~~~~~~~~~~ |
| |
| Creates items preceeded by images. The image to be used is either referenced or |
| stored using base64 encoded binary data. |
| |
| |
| |
| |
| |
| .. |
| Local Variables: |
| mode: rst |
| fill-column: 79 |
| End: |
| vim: et syn=rst tw=79 |