blob: 0ccafc9021816d11d0033952ac029339f7cb48e2 [file] [log] [blame]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Entity resolution with catalogs</title>
<link href="http://purl.org/DC/elements/1.0/" rel="schema.DC">
<meta content="Resolve external entities to local or other resources" name="DC.Subject">
<meta content="David Crossley" name="DC.Creator">
</head>
<body>
<h1>Introduction</h1>
<p>
Apache Cocoon has the capability to utilise an entity resolution mechanism.
External entities (e.g. Document Type Definitions (DTDs), character entity
sets, XML sub-documents) are resources that are declared by an XML instance
document - they exist as separate objects. An entity catalog assists with
entity management and the resolution of entities to accessible resources.
It also reduces the necessity for expensive and failure-prone network
retrieval of the required resources.
</p>
<h1>Overview</h1>
<p>
"Entities" represent the physical structure of an XML instance document,
whereas "elements" represent the logical structure. The complete entity
structure of the document defines which pieces need to be incorporated, so
as to build the final document. Those entities are objects from some
accessible place, e.g. local file system, local network, remote network,
generated from a database. Example entities are: DTDs, XML sub-documents,
sets of character entities to represent symbols and other glyphs, image
files.
</p>
<p>
So how are you going to define the accessible location of all those pieces?
How will you ensure that those resources are reliably available? Entity
resolution catalogs to the rescue. These are simple standards-based
plain-text files to map public identifiers and system identifiers to local
or other resources.
</p>
<p>
Do you wonder why we cannot use the sitemap to resolve these resources?
This is because the resolution of all entities that compose the XML
document is under the direct control of the guts of the parser and the XML
structure. The parser has no choice - it must incorporate all of the defined pieces. If it cannot retrieve them, then it is broken and reports an error.
</p>
<p>
With the powerful catalog support there are no such problems. This document
provides the following sections to explain Cocoon capability for
resolving entities ...
</p>
<ul>
<li>
<a href="#background">Background</a>
- explains the need, explains some terminology, describes the solution
</li>
<li>
<a href="#demo1">Demonstration #1</a>
- explains a remote resource and how it gets resolved
</li>
<li>
<a href="#cat">Catalogs overview</a>
- briefly explains how catalogs resolve entity declarations
</li>
<li>
<a href="#demo2">Demonstration #2</a>
- explains more detailed need and use of catalogs
and shows catalogs in action
</li>
<li>
<a href="#default">Default configuration</a>
- explains the default automated configuration
</li>
<li>
<a href="#config">Local configuration</a>
- explains how to extend the default configuration for your local
system requirements and provides an example
</li>
<li>
<a href="#imp">Implementation notes</a>
- describes how support for catalogs is added to Cocoon
</li>
<li>
<a href="#dev">Development notes</a>
- some minor issues need to be addressed
</li>
<li>
<a href="#notes">Other notes</a>
- assorted dot-points
</li>
<li>
<a href="#summ">Summary</a>
</li>
<li>
<a href="#info">Further information</a>
- links to some useful resources
</li>
</ul>
<a name="background"></a>
<h1>Background</h1>
<p>
The following article eloquently describes the need for all parsers and
XML frameworks to be capable of utilising entity resolvers.
"<a class="external" href="http://xml.apache.org/commons/components/resolver/">XML Entity
and URI Resolvers</a>" by Norman Walsh. Please read that document,
then return here to apply entity catalogs to Cocoon.
</p>
<p>
(Note: The <a class="external" href="http://xml.apache.org/commons/">Apache XML
Commons</a> project provides the Java package that has been added to
Cocoon as the <span class="codefrag">lib/core/xml-commons-resolver.jar</span> package.
There are also
API javadocs for <span class="codefrag">resolver</span> that have further information.
However, you do not need to know the gory details to understand catalogs
and configure them.)
</p>
<a name="demo1"></a>
<h1>Demonstration #1</h1>
<p>
This snippet from an XML instance shows the Document Type Declaration.
Notice that it declares its ruleset, the Document Type Definition (DTD),
as an external entity. Notice also that the resource is network-based.
</p>
<pre class="code">
&lt;?xml version="1.0"?&gt;
&lt;!DOCTYPE article PUBLIC "-//OASIS//DTD Simplified DocBook XML V4.1.2.5//EN"
"http://www.oasis-open.org/docbook/xml/simple/4.1.2.5/sdocbook.dtd"
&lt;article&gt;
... content goes here
&lt;/article&gt;
</pre>
<p>
Now consider what will happen when Cocoon tries to process this XML
instance. Whether you have set validation=yes or not, the parser will
still want to resolve all of the entities that are required by the XML
instance (i.e. the DTD and any other entities that the DTD might declare).
So it will happily trundle across the network to get them. It will do this
every time that the document is processed. This is obviously a needless
overhead. Worse still, what happens if that host is down or the network is
congested. Additionally, if your Cocoon is an off-line server then it is
always broken because it cannot retrieve the network-based resources.
</p>
<a name="cat"></a>
<h1>Catalogs overview</h1>
<p>
As the Walsh document explained, the secrets to entity resolution are the
public identifiers, system identifiers, and the catalog to map between them.
Here we provide an overview and show an example catalog which we will then
use with the <a href="#demo2">Demonstration #2</a> below.
</p>
<h2>External entity declarations</h2>
<p>
To define an external entity in an XML instance document, you must
provide an external declaration consisting of at least a
<strong>system identifier</strong> and optionally a
<strong>public identifier</strong>. The system identifier defines the
physical location of the external entity. The public identifier is a
unique symbolic name that can be used to map to a certain physical location.
Note that if you provide both a public and a system identifier, then the
public identifier is listed first and the system identifier is not
preceded by the keyword <span class="codefrag">SYSTEM</span>.
Here are four separate examples ...
</p>
<pre class="code">
&lt;!ENTITY pic SYSTEM "images/pic.gif" NDATA gif&gt;
&lt;!ENTITY % ISOnum PUBLIC
"ISO 8879:1986//ENTITIES Numeric and Special Graphic//EN//XML" "ISOnum.pen"&gt;
&lt;!DOCTYPE document SYSTEM "dtd/document-v10.dtd"&gt;
&lt;!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1//EN"
"http://www.oasis-open.org/docbook/xml/4.1/docbookx.dtd"&gt;
</pre>
<p>
(In your XML instance document, or DTD, you would include those entities
like this ... <span class="codefrag">%ISOnum;</span>)
</p>
<p>
None of those system identifiers looks reliable or easily managed.
Use a catalog to make them so.
</p>
<h2>Simple example catalog</h2>
<p>
The <span class="codefrag">catalog</span> maps public identifiers to their corresponding
physical locations. The catalog entries in an OASIS catalog are a simple
whitespace-delimited format.
(The <a href="#info">specification</a> fully defines the format.)
There about a dozen different types of catalog entry - two important
ones are:
</p>
<ul>
<li>
<strong>PUBLIC</strong> <span class="codefrag">publicId systemId</span>
<br>- maps the public identifier <span class="codefrag">publicId</span> to the system
identifier <span class="codefrag">systemId</span>
</li>
<li>
<strong>SYSTEM</strong> <span class="codefrag">systemId otherSystemId</span>
<br>- maps the system identifier <span class="codefrag">systemId</span> to the alternate
system identifier <span class="codefrag">otherSystemId</span>
</li>
</ul>
<pre class="code">
-- this is the default OASIS catalog for Apache Cocoon --
OVERRIDE YES
-- ISO public identifiers for sets of character entities --
PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN//XML"
"ISOlat1.pen"
PUBLIC "ISO 8879:1986//ENTITIES Added Latin 1//EN//XML"
"ISOlat1.pen"
PUBLIC "ISO 9573-15:1993//ENTITIES Greek Letters//EN//XML"
"ISOgrk1.pen"
PUBLIC "ISO 8879:1986//ENTITIES Publishing//EN//XML"
"ISOpub.pen"
PUBLIC "ISO 8879:1986//ENTITIES General Technical//EN//XML"
"ISOtech.pen"
PUBLIC "ISO 8879:1986//ENTITIES Numeric and Special Graphic//EN//XML"
"ISOnum.pen"
-- Document Type Definitions --
PUBLIC "-//APACHE//DTD Documentation V1.0//EN"
"document-v10.dtd"
PUBLIC "-//APACHE//DTD FAQ V1.0//EN"
"faq-v10.dtd"
-- (other declarations removed for brevity) --
-- these entries are used for the catalog-demo sample application --
OVERRIDE NO
PUBLIC "-//Arbortext//TEXT Test Override//EN"
"catalog-demo/override.txt"
OVERRIDE YES
PUBLIC "-//Arbortext//TEXT Test Public Identifier//EN"
"catalog-demo/testpub.txt"
SYSTEM "urn:x-arbortext:test-system-identifier"
"catalog-demo/testsys.txt"
PUBLIC "-//Indexgeo//DTD Catalog Demo v1.0//EN"
"catalog-demo/catalog-demo-v10.dtd"
-- end of entries for the catalog-demo sample application --
</pre>
<p>
System identifiers can use full pathnames, filenames, relative pathnames,
or URLs - in fact, any method that will define and deliver the actual
physical entity. If it is just a filename or a relative pathname, then the
Catalog Resolver will look for the resource relative to the location of
the catalog.
</p>
<p>
When the parser needs to load a declared entity, then it first consults
the Catalog Resolver to get a possible mapping to an alternate system
identifier. If there is no mapping for an identifier in the catalogs
(or in any sub-ordinate catalogs), then Cocoon will carry on to
retrieve the resource using the original declared system identifier.
</p>
<a name="demo2"></a>
<h1>Demonstration #2</h1>
<p>
See catalogs in action with the
<a href="../../overview.html#samples">Cocoon Samples</a>
The demonstration
intends to be self-documenting. The top-level XML instance describes its
role, and each included external entity reports how it came into being.
This example builds upon the example provided by the Walsh article.
(Tip: To see the error message that would result from not using a catalog,
simply rename the default <span class="codefrag">catalog</span> file before starting
Cocoon.)
</p>
<p>Here is the source for the top-level XML instance document
<span class="codefrag">catalog-demo.xml</span> ...
</p>
<pre class="code">
&lt;?xml version="1.0"?&gt;
&lt;!DOCTYPE catalog-demo PUBLIC "-//Indexgeo//DTD Catalog Demo v1.0//EN"
"http://www.indexgeo.com.au/dtd/catalog-demo-v10.dtd"
[
&lt;!ENTITY testpub PUBLIC "-//Arbortext//TEXT Test Public Identifier//EN"
"bogus-system-identifier.xml"&gt;
&lt;!ENTITY testsys SYSTEM "urn:x-arbortext:test-system-identifier"&gt;
&lt;!ENTITY testovr PUBLIC "-//Arbortext//TEXT Test Override//EN"
"testovr.txt"&gt;
&lt;!ENTITY % ISOnum PUBLIC
"ISO 8879:1986//ENTITIES Numeric and Special Graphic//EN//XML"
"ISOnum.pen"&gt;
%ISOnum;
&lt;!ENTITY note "Note:"&gt;
]&gt;
&lt;catalog-demo&gt;
&lt;section&gt;
&lt;para&gt;This sample application demonstrates the use of catalogs for
entity resolution. &amp;note; see the Apache Cocoon documentation
&lt;link href="../../docs/userdocs/concepts/catalog.html"&gt;Entity resolution
with catalogs&lt;/link&gt; for the full background and explanation, and the XML
source of this document (catalog-demo.xml).
&lt;/para&gt;
&lt;para&gt;This top-level XML instance document is catalog-demo.xml - it declares
three other XML sub-documents as external entities and then includes
them in the sections below. The real system identifiers will be looked
up in the catalog, to resolve the actual location of the resource.
&lt;/para&gt;
&lt;para&gt;The Document Type Definition (DTD) is declared using both a public
identifier and a system identifier. The system identifier for the DTD is
a network-based resource (which is deliberately non-existent). However,
the catalog overrides that remote DTD to instead use a copy from the
local filesystem at the location defined by the catalog entry. Note that
it is via the use of a public identifier that we gain this power.
&lt;/para&gt;
&lt;para&gt;The internal DTD subset of the top-level document instance goes on
to declare the three external sub-document entities using various means.
It also declares and includes the ISOnum set of character entities,
so that we can use entities like "&amp;amp;frac12;" (to represent &amp;frac12;).
Finally the internal DTD subset declares an internal general entity
for &amp;quot;&amp;amp;note;&amp;quot;.
&lt;/para&gt;
&lt;/section&gt;
&lt;section&gt;
&lt;para&gt;testpub ... this entity is declared with a PUBLIC identifier and a
bogus system identifier (which will be overridden by the catalog)
&lt;/para&gt;
&lt;para&gt;&amp;note; &amp;testpub;&lt;/para&gt;
&lt;/section&gt;
&lt;section&gt;
&lt;para&gt;testsys ... this entity is declared with a SYSTEM identifier
(which will be resolved by the catalog)
&lt;/para&gt;
&lt;para&gt;&amp;note; &amp;testsys;&lt;/para&gt;
&lt;/section&gt;
&lt;section&gt;
&lt;para&gt;testovr ... is declared with a PUBLIC identifier and a system
identifier (the catalog is set to not override this one, so the
declared system identifier is used)
&lt;/para&gt;
&lt;para&gt;&amp;note; &amp;testovr;&lt;/para&gt;
&lt;/section&gt;
&lt;/catalog-demo&gt;
</pre>
<p>
Here is the source for one of the included sub-document external entities
<span class="codefrag">testpub.txt</span> (a slab of plain text) ...
</p>
<pre class="code">
This paragraph is automatically included from the
testpub.txt external file.
The entity declaration deliberately used a non-existent file
as the system identifier. The catalog then used the declared
public identifer to resolve to a specific location on the local
filesystem.
</pre>
<a name="default"></a>
<h1>Default configuration</h1>
<p>
A default catalog and some base entities (e.g. ISO*.pen character
entity sets) are included in the Cocoon distribution at
<span class="codefrag">WEB-INF/entities/</span>
- the default catalog is automatically loaded when Cocoon starts.
</p>
<p>
If you suspect problems, then you can raise the level of the
<span class="codefrag">verbosity</span> property (to 2 or 3) and watch the messages going
to standard output when Cocoon starts and operates. You would also do
this to detect any misconfiguration of your own catalogs.
</p>
<a name="config"></a>
<h1>Local configuration</h1>
<p>You can extend the default configuration to include local catalogs
for site-specific requirements. This is achieved via various means.
</p>
<h2>Using cocoon.xconf</h2>
<p>Parameters (properties) for the resolver component can be specified in the
<span class="codefrag">src/webapp/WEB-INF/cocoon.xconf</span>
configuration file. See the detailed internal notes - here is a precis.
</p>
<ul>
<li>
<span class="codefrag">catalog</span>
... The main catalog file. Its path name is relative to the
Cocoon context directory.
</li>
<li>
<span class="codefrag">local-catalog</span>
... The full filesystem pathname to a single local catalog file.
</li>
<li>
<span class="codefrag">verbosity</span>
... The level of messages from the resolver
(loading catalogs, identifier resolution, etc.).
It value may range from 0 (no messages), to 10 detailed log messages.
</li>
</ul>
<h2>Using CatalogManager.properties</h2>
<p>An annotated <span class="codefrag">CatalogManager.properties</span> file is included
with the distribution - modify it to suit your needs. You can add your
own local catalogs using the <span class="codefrag">catalogs</span> property.
(See the notes inside the properties file).
</p>
<p>
The file is at
<span class="codefrag">webapp/WEB-INF/classes/CatalogManager.properties</span>
thereby making it available to the Java classpath during startup of the
servlet engine.
</p>
<p>
If you see an error message going to STDOUT when Cocoon starts
(<span class="codefrag">Cannot find CatalogManager.properties</span>) then this means that
the properties file is not available to the Java classpath. Note that this
does not mean that entity resolution is disabled, rather that no local
configuration is being effected. Therefore no local catalogs will be
loaded and no entity resolution messages will be received (verbosity
level is zero by default).
</p>
<p>
That may truly be the intention, and not just a configuration mistake.
You can still use <span class="codefrag">cocoon.xconf</span> to effect your local
configuration.
</p>
<h2>Resolver directives inside your catalog file</h2>
<p>
The actual "catalog" files have a powerful set of directives.
For example, the <strong>CATALOG</strong> directive facilitates the
inclusion of a sub-ordinate catalog. The list of resources below will
lead to <a href="#info">further information</a> about catalog usage.
</p>
<h2>Example local configuration for Simplified DocBook</h2>
<p>
We use the Simplified DocBook XML DTD for some of our documentation.
Here are the few steps that we followed to configure Cocoon to be able
to process our XML instances.
</p>
<ul>
<li>
Downloaded a recent copy of the Simplified DocBook DTD (the flattened DTD
will suffice) from
<a class="external" href="http://www.oasis-open.org/docbook/">here</a>
and place it at
<span class="codefrag">/usr/local/sgml/docbook/simple/sdocbook.dtd</span>
</li>
<li>
Created a catalog file at
<span class="codefrag">/usr/local/sgml/docbook/simple/sdocbook.cat</span>
with a single entry for the Simplified DocBook XML DTD
</li>
<li>
Added the parameter (<span class="codefrag">local-catalog</span>) to the
<span class="codefrag">WEB-INF/cocoon.xconf</span>
(using the full pathname to the <span class="codefrag">sdocbook.cat</span> catalog).
</li>
</ul>
<pre class="code">
-- Catalog file (sdocbook.cat) for Simplified DocBook --
-- See www.oasis-open.org/docbook/ --
-- Driver file for the Simplified DocBook XML DTD --
PUBLIC "-//OASIS//DTD Simplified DocBook XML V4.1.2.5//EN"
"sdocbook.dtd"
-- end of catalog file for Simplified DocBook --
</pre>
<p>
We could similarly configure Cocoon for the full DocBook XML DTD and
related entities. In fact, the DocBook distribution already contains a
catalog file. We need only append the pathname to our <span class="codefrag">catalogs</span>
property.
</p>
<p>
There are a few important starting points for
<a href="#info">further information</a> about using and configuring
the DocBook DTDs.
</p>
<a name="imp"></a>
<h1>Implementation notes</h1>
<p>
The SAX <span class="codefrag">Parser</span> interface provides an <span class="codefrag">entityResolver</span>
hook to allow an application to resolve the external entities. This is
enabled via
<span class="codefrag">org.apache.excalibur.xml.EntityResolver</span>
</p>
<p>
The
<a class="external" href="http://xml.apache.org/commons/">Apache XML Commons</a>
project has <span class="codefrag">org.apache.xml.resolver</span>
which provides a <strong>CatalogResolver</strong>. This is incorporated
into Cocoon via <span class="codefrag">org.apache.excalibur.xml.EntityResolver</span>
</p>
<p>
<a href="#default">Default configuration</a> is achieved via
<span class="codefrag">org.apache.cocoon.components.resolver.DefaultResolver.java</span>
which initialises the catalog resolver and loads a default system catalog.
The <span class="codefrag">DefaultResolver.java</span> enables <a href="#config">local
configuration</a> by applying properties from the
<span class="codefrag">CatalogManager.properties</span> file and then further configuration
from <span class="codefrag">WEB-INF/cocoon.xconf</span> parameters.
</p>
<a name="debug"></a>
<h1>Debugging the resolver configuration</h1>
<p>
Raise the verbosity level as described in cocoon.xconf and watch the
messages that go to standard output.
</p>
<p>
The "Resolved public" messages should show that the Public Identifiers
are being used to find the local copies of the resources. If a "Resolved
public" message does not occur for a particular resource, then Cocoon
will be retrieving it from the specified location. Use a packet watching
tool like "ngrep" to see the HTTP request for the resource.
</p>
<a name="dev"></a>
<h1>Development notes</h1>
<ul>
<li>Keep up-to-date with releases of
<a class="external" href="http://www.oasis-open.org/docbook/xmlcharent/">XML Character
Entities</a> by OASIS.
</li>
</ul>
<a name="notes"></a>
<h1>Other notes</h1>
<ul>
<li>OASIS Catalogs (TR 9401:1995 Entity Management) are plain-text files
with a simple delimited format. There is also a new standard being
developed for XML Catalogs, using an xml-based structured plain-text file
(gee :-). Links to both standards are provided below. Both catalog formats
can be currently used with this entity resolver. However, the latter
standard is not yet settled. OASIS TR9401 catalogs will suffice.
</li>
<li>There has been a recent flood of XML tools - unfortunately, many do not
implement entity resolution (other than by brute-force retrieval), so
those tools are crippled and cannot be used for serious XML processing.
Please ensure that you choose
<a class="external" href="http://www.oasis-open.org/cover/">proper XML tools</a>
for the preparation and validation of your XML instance documents.
</li>
<li>The default catalog that is shipped with the Cocoon distribution is
deliberately basic. You will need to supplement it with your own catalog
devised to suit your particular needs.
</li>
</ul>
<a name="summ"></a>
<h1>Summary</h1>
<p>
Most XML documents that we would want to serve with Cocoon are already
in existence in another information system. The XML document instances have
a declaration of their DTD Document Type Definition as an external file.
This external DTD also includes entity sets such as ISOnum, ISOlat1, etc.
Also the DTD declaration has a Formal Public Identifier and a System
Identifier which points to a remote URL. These XML instance documents cannot
be altered to make workaround solutions like
<span class="codefrag">../dtd/document-1.0.dtd</span>
</p>
<p>
Entity management is effected by providing a standards-based mechanism to
resolve public identifiers and system identifiers to local filenames or
other identifiers or even to other remote network resources. So references
to external DTDs, sets of character entities such as mathematical symbols,
fragments of XML documents, complete sub-documents, non-xml data chunks
(like images), etc. can all be centrally managed and resolved locally.
</p>
<a name="info"></a>
<h1>Further information</h1>
<p>
Here are some links to documents which extol entity management:
</p>
<ul>
<li>
<a class="external" href="http://www.oasis-open.org/committees/entity/">OASIS Entity
Resolution Technical Committee</a> - see especially the
<a class="external" href="http://www.oasis-open.org/specs/a401.html">specification for
OASIS Catalogs</a> (TR 9401:1995 Entity Management)
and the
<a class="external" href="http://www.oasis-open.org/committees/entity/spec.html">specification for XML Catalogs</a>
</li>
<li>
<a class="external" href="http://www.oasis-open.org/cover/topics.html#entities">SGML/XML Special Topics: Entity Sets and Entity Management</a>
at the
<a class="external" href="http://www.oasis-open.org/cover/">XML Cover Pages</a>
</li>
<li>
<a class="external" href="http://www.oasis-open.org/cover/topics.html#fpi-fsi">SGML/XML
Special Topics: Catalogs, Formal Public Identifiers, Formal System
Identifiers</a>
at the
<a class="external" href="http://www.oasis-open.org/cover/">XML Cover Pages</a>
</li>
<li>Arbortext column by Norman Walsh
<a class="external" href="http://www.arbortext.com/html/presentations___articles.html">Standard
Deviations from Norm</a>
<br> - Issue Three:
<a class="external" href="http://www.arbortext.com/html/issue_three.html">If You Can Name It, You Can Claim It!</a>
</li>
<li>
The
<a class="external" href="http://xml.apache.org/commons/">Apache XML Commons</a>
project provides the
entity resolver Java classes (which are used in Cocoon) and evolution of
the Arbortext article into
<a class="external" href="http://xml.apache.org/commons/components/resolver/">XML
Entity and URI Resolvers</a>.
</li>
<li>XML-Deviant article 2000-11-29
<a class="external" href="http://www.xml.com/pub/a/2000/11/29/deviant.html">What's in
a Name?</a>
</li>
<li>
<a class="external" href="http://www.oasis-open.org/docbook/">DocBook</a>:
<a class="external" href="http://www.docbook.org/">The Definitive Guide</a>
- Section 2.3 Public Identifiers, System Identifiers, and Catalog Files
</li>
<li>
FAQ <a class="external" href="http://www.dpawson.co.uk/docbook/catalogs.html">Catalogs
and Docbook</a>
</li>
<li>OASIS is the <a class="external" href="http://www.oasis-open.org/docbook/">official
home</a> of the DocBook DTDs
(see also <a class="external" href="http://docbook.sourceforge.net/">DocBook Open
Repository</a> project at SourceForge)
</li>
<li>Organization for the Advancement of Structured Information Standards
(<a class="external" href="http://www.oasis-open.org/">OASIS</a>)</li>
</ul>
</body>
</html>