| <?xml version="1.0" standalone="no"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one or more |
| contributor license agreements. See the NOTICE file distributed with |
| this work for additional information regarding copyright ownership. |
| The ASF licenses this file to You under the Apache License, Version 2.0 |
| (the "License"); you may not use this file except in compliance with |
| the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |
| --> |
| <!-- $Id$ --> |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> |
| <document> |
| <header> |
| <title>Apache FOP: Hyphenation</title> |
| <version>$Revision$</version> |
| </header> |
| <body> |
| <section id="support"> |
| <title>Hyphenation Support</title> |
| <section id="intro"> |
| <title>Introduction</title> |
| <p>FOP uses Liang's hyphenation algorithm, well known from TeX. It needs |
| language specific pattern and other data for operation.</p> |
| <p>Because of <a href="#license-issues">licensing issues</a> (and for |
| convenience), all hyphenation patterns for FOP are made available through |
| the <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">Objects For |
| Formatting Objects</a> project.</p> |
| <note>If you have made improvements to an existing FOP hyphenation pattern, |
| or if you have created one from scratch, please consider contributing these |
| to OFFO so that they can benefit other FOP users as well. |
| Please inquire on the <a href="../maillist.html#fop-user">FOP User |
| mailing list</a>.</note> |
| </section> |
| <section id="license-issues"> |
| <title>License Issues</title> |
| <p>Many of the hyphenation files distributed with TeX and its offspring are |
| licenced under the <a class="fork" href="http://www.latex-project.org/lppl.html">LaTeX |
| Project Public License (LPPL)</a>, which prevents them from being |
| distributed with Apache software. The LPPL puts restrictions on file names |
| in redistributed derived works which we feel can't guarantee. Some |
| hyphenation pattern files have other or additional restrictions, for |
| example against use for commercial purposes.</p> |
| <p>Although Apache FOP cannot redistribute hyphenation pattern files that do |
| not conform with its license scheme, that does not necessarily prevent users |
| from using such hyphenation patterns with FOP. However, it does place on |
| the user the responsibility for determining whether the user can rightly use |
| such hyphenation patterns under the hyphenation pattern license.</p> |
| <warning>The user is responsible to settle license issues for hyphenation |
| pattern files that are obtained from non-Apache sources.</warning> |
| </section> |
| <section id="sources"> |
| <title>Sources of Custom Hyphenation Pattern Files</title> |
| <p>The most important source of hyphenation pattern files is the |
| <a class="fork" href="http://www.ctan.org/tex-archive/language/hyphenation/">CTAN TeX |
| Archive</a>.</p> |
| </section> |
| <section id="install"> |
| <title>Installing Custom Hyphenation Patterns</title> |
| <p>To install a custom hyphenation pattern for use with FOP:</p> |
| <ol> |
| <li>Convert the TeX hyphenation pattern file to the FOP format. The FOP |
| format is an xml file conforming to the DTD found at |
| <code>{fop-dir}/hyph/hyphenation.dtd</code>.</li> |
| <li>Name this new file following this schema: |
| <code>languageCode_countryCode.xml</code>. The country code is |
| optional, and should be used only if needed. For example: |
| <ul> |
| <li><code>en_US.xml</code> would be the file name for American |
| English hyphenation patterns.</li> |
| <li><code>it.xml</code> would be the file name for Italian |
| hyphenation patterns.</li> |
| </ul> |
| The language and country codes must match the XSL-FO input, which |
| follows <a href="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt">ISO |
| 639</a> (languages) and <a href="http://www.ics.uci.edu/pub/ietf/http/related/iso3166.txt">ISO |
| 3166</a> (countries). NOTE: The ISO 639/ISO 3166 convention is that |
| language names are written in lower case, while country codes are written |
| in upper case. FOP does not check whether the language and country specified |
| in the FO source are actually from the current standard, but it relies |
| on it being two letter strings in a few places. So you can make up your |
| own codes for custom hyphenation patterns, but they should be two |
| letter strings too (patches for proper handling extensions are welcome)</li> |
| <li>There are basically three ways to make the FOP-compatible hyphenation pattern |
| file(s) accessible to FOP: |
| <ul> |
| <li>Download the precompiled JAR from <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">OFFO |
| </a> and place it either in the <code>{fop-dir}/lib</code> directory, or |
| in a directory of your choice (and append the full path to the JAR to |
| the environment variable <code>FOP_HYPHENATION_PATH</code>).</li> |
| <li>Download the desired FOP-compatible hyphenation pattern file(s) from |
| <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">OFFO</a>, |
| and/or take your self created hyphenation pattern file(s), |
| <ul> |
| <li>place them in the directory <code>{fop-dir}/hyph</code>, </li> |
| <li>or place them in a directory of your choice and set the Ant variable |
| <code>user.hyph.dir</code> to point to that directory (in |
| <code>build-local.properties</code>),</li> |
| </ul> |
| and run Ant with build target |
| <code>jar-hyphenation</code>. This will create a JAR containing the |
| compiled patterns in <code>{fop-dir}/build</code> that will be added to the |
| classpath on the next run. |
| (When FOP is built from scratch, and there are pattern source file(s) |
| present in the directory pointed to by the |
| <code>user.hyph.dir</code> variable, this JAR will automatically |
| be created from the supplied pattern(s)).</li> |
| <li>Put the pattern source file(s) into a directory of your choice and |
| configure FOP to look for custom patterns in this directory, by setting the |
| <a href="configuration.html"><hyphenation-base></a> |
| configuration option.</li> |
| </ul> |
| </li> |
| </ol> |
| <warning> |
| Either of these three options will ensure hyphenation is working when using |
| FOP from the command-line. If FOP is being embedded, remember to add the location(s) |
| of the hyphenation JAR(s) to the CLASSPATH (option 1 and 2) or to set the |
| <a href="configuration.html#hyphenation-dir"><hyphenation-dir></a> |
| configuration option programmatically (option 3). |
| </warning> |
| </section> |
| </section> |
| <section id="patterns"> |
| <title>Hyphenation Patterns</title> |
| <p>If you would like to build your own hyphenation pattern files, or modify |
| existing ones, this section will help you understand how to do so. Even |
| when creating a pattern file from scratch, it may be beneficial to start |
| with an existing file and modify it. See <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html"> |
| OFFO's Hyphenation page</a> for examples. |
| Here is a brief explanation of the contents of FOP's hyphenation patterns:</p> |
| <warning>The remaining content of this section should be considered "draft" |
| quality. It was drafted from theoretical literature, and has not been |
| tested against actual FOP behavior. It may contain errors or omissions. |
| Do not rely on these instructions without testing everything stated here. |
| If you use these instructions, please provide feedback on the |
| <a href="../maillist.html#fop-user">FOP User mailing list</a>, either |
| confirming their accuracy, or raising specific problems that we can |
| address.</warning> |
| <ul> |
| <li>The root of the pattern file is the <hyphenation-info> element.</li> |
| <li><hyphen-char>: its attribute "value" contains the character signalling |
| a hyphen in the <exceptions> section. It has nothing to do with the |
| hyphenation character used in FOP, use the XSLFO hyphenation-character |
| property for defining the hyphenation character there. At some points |
| a dash U+002D is hardwired in the code, so you'd better use this too |
| (patches to rectify the situation are welcome). There is no default, |
| if you declare exceptions with hyphenations, you must declare the |
| hyphen-char too.</li> |
| <li><hyphen-min> contains two attributes: |
| <ul> |
| <li>before: the minimum number of characters in a word allowed to exist |
| on a line immediately preceding a hyphenated word-break.</li> |
| <li>after: the minimum number of characters in a word allowed to exist |
| on a line immediately after a hyphenated word-break.</li> |
| </ul> |
| This element is unused and not even read. It should be considered a |
| documentation for parameters used during pattern generation. |
| </li> |
| <li><classes> contains whitespace-separated character sets. The members |
| of each set should be treated as equivalent for purposes of hyphenation, |
| usually upper and lower case of the same character. The first character |
| of the set is the canonical character, the patterns and exceptions |
| should only contain these canonical representation characters (except |
| digits for weight, the period (.) as word delimiter in the patterns and |
| the hyphen char in exceptions, of course).</li> |
| <li><exceptions> contains whitespace-separated words, each of which |
| has either explicit hyphen characters to denote acceptable breakage |
| points, or no hyphen characters, to indicate that this word should |
| never be hyphenated, or contain explicit <hyp> elements for specifying |
| changes of spelling due to hyphenation (like backen -> bak-ken or |
| Stoffarbe -> Stoff-farbe in the old german spelling). Exceptions override |
| the patterns described below. Explicit <hyp> declarations don't work |
| yet (patches welcome). Exceptions are generally a bit brittle, test |
| carefully.</li> |
| <li><patterns> includes whitespace-separated patterns, which are what |
| drive most hyphenation decisions. The characters in these patterns are |
| explained as follows: |
| <ul> |
| <li>non-numeric characters represent characters in a sub-word to be |
| evaluated</li> |
| <li>the period character (.) represents a word boundary, i.e. either |
| the beginning or ending of a word</li> |
| <li>numeric characters represent a scoring system for indicating the |
| acceptability of a hyphen in this location. Odd numbers represent an |
| acceptable location for a hyphen, with higher values overriding lower |
| inhibiting values. Even numbers indicate an unacceptable location, with |
| higher values overriding lower values indicating an acceptable position. |
| A value of zero (inhibiting) is implied when there is no number present. |
| Generally patterns are constructed so that valuse greater than 4 are rare. |
| Due to a bug currently patterns with values of 8 and greater don't |
| have an effect, so don't wonder.</li> |
| </ul> |
| Here are some examples from the English patterns file: |
| <ul> |
| <li>Knuth (<em>The TeXBook</em>, Appendix H) uses the example <strong>hach4</strong>, which indicates that it is extremely undesirable to place a hyphen after the substring "hach", for example in the word "toothach-es".</li> |
| <li><strong>.leg5e</strong> indicates that "leg-e", when it occurs at the beginning of a word, is a very good place to place a hyphen, if one is needed. Words like "leg-end" and "leg-er-de-main" fit this pattern.</li> |
| </ul> |
| Note that the algorithm that uses this data searches for each of the word's substrings in the patterns, and chooses the <em>highest</em> value found for letter combination. |
| </li> |
| </ul> |
| <p>If you want to convert a TeX hyphenation pattern file, you have to undo |
| the TeX encoding for non-ASCII text. FOP uses Unicode, and the patterns |
| must be proper Unicode too. You should be aware of the XML encoding issues, |
| preferably use a good Unicode editor.</p> |
| <p>Note that FOP does not do Unicode character normalization. If you use |
| combining chars for accents and other character decorations, you must |
| declare character classes for them, and use the same sequence of base character |
| and combining marks in the XSLFO source, otherwise the pattern wouldn't match. |
| Fortunately, Unicode provides precomposed characters for all important cases |
| in common languages, until now nobody run seriously into this issue. Some dead |
| languages and dialects, especially ancient ones, may pose a real problem |
| though.</p> |
| <p>If you want to generate your own patterns, an open-source utility called |
| patgen is available on many Unix/Linux distributions and every TeX |
| distribution which can be used to assist in |
| creating pattern files from dictionaries. Pattern creation for languages like |
| english or german is an art. If you can, read Frank Liang's original paper |
| "Word Hy-phen-a-tion by Com-pu-ter" (yes, with hyphens). It is not available |
| online. The original patgen.web source, included in the TeX source distributions, |
| contains valuable comments, unfortunately technical details obscure often the |
| high level issues. Another important source is |
| <a class="fork" href="http://www.ctan.org/tex-archive/systems/knuth/tex/texbook.tex">The |
| TeX Book</a>, appendix H (either read the TeX source, or run it through |
| TeX to typeset it). Secondary articles, for example the works by Petr Sojka, |
| may also give some much needed insight into problems arising in automated |
| hyphenation.</p> |
| </section> |
| </body> |
| </document> |