blob: 10e35d0c4696b8a79964ad03fe3765f53b7b08c3 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.0//EN" "../../dtd/document-v10.dtd">
<document>
<header>
<title>Offline Page Generation</title>
<version>1.0</version>
<type>Technical document</type>
<authors><person name="Upayavira" email="upayavira@apache.org"/>
</authors>
<abstract>This document explains the basic concepts of offline page generation with Apache Cocoon.</abstract>
</header>
<body>
<s1 title="Overview">
<p>Cocoon can generate static, 'offline' versions of web pages or web sites, as well
as sites served dynamically. This document covers the concepts involved in offline
page and site generation.
</p>
</s1>
<s1 title="Offline Page Generation">
<p>Cocoon allows static versions of Cocoon web sites to be created.</p>
<p>At present, this can be done in three ways:</p>
<ul>
<li><link href="cli.html">Command Line Interface</link></li>
<li><link href="ant.html">Using Ant Task</link></li>
<li><link href="bean.html">Cocoon Bean</link></li>
</ul>
<p>This document explains the general concepts that are shared by all of these approaches.
The specific details for each method are explained on a separate page.</p>
<p>Cocoon, when generating pages offline, can follow links in a page (whether that page
is HTML, PDF or anything else), and can rewrite URIs to create filenames by checking
the mime type of the generated page. All links to pages who's URIs change are changed
too.
</p>
</s1>
<s1 title="Configuration">
<p>To use Cocoon in its offline mode, a servlet container (e.g. Tomcat or Jetty) is not
needed. Cocoon can generate an offline site directly using the information available
in the Cocoon <code>webapp</code> folder.</p>
<p>Having said this, many choose to have a servlet container available locally for use
whilst debugging, as this can speed up the development process significantly.</p>
<s2 title="Directories and Files">
<p>As all the information Cocoon needs to generate a site is stored in the Cocoon
webapp directory, we need to tell it where to find it, and where to find various
other files and directories. These are:</p>
<ul>
<li>Context directory (the Cocoon Webapp directory)</li>
<li>Configuration File (usually <code>${COCOON_WEBAPP}/WEB-INF/cocoon.xconf</code>)</li>
<li>Work Directory (used by Cocoon to store temporary files, this can be anywhere of your choosing)</li>
</ul>
</s2>
<s2 title="Logging">
<p>There are three options that need to be specified in relation to logging. These are:</p>
<ul>
<li>Log Kit (the logging configuration file, usually <code>${COCOON_WEBAPP}/WEB-INF/logkit.xconf</code>)</li>
<li>Logger (a category used for logging, as configured in the configuration file)</li>
<li>Log Level (a logging level, either DEBUG, INFO WARN, ERROR or FATAL_ERROR. Relates specifically to logging
at startup, after which log kit configuration takes over)</li>
</ul>
</s2>
<s2 title="Other Configuration Options">
<p>In online mode, a User agent string tells Cocoon what browser is being used to access a page. The user agent
can be configured manually for offline generation.</p>
<p>In online mode, an accept string is provided by a browser, telling the browser what types of content it
is capable of accepting. This will be a comma separated list of mime types. In offline mode, an accept
string can also be specified.</p>
<p>As Cocoon based sites can change the content they generate based upon the user agent string and the accepts
string, it can be necessary to specify them in order to have the correct content generated.</p>
<p>In order to generate sites that make use of databases and database connections, it is necessary to load
JDBC classes at startup. Cocoon allows for this.</p>
<p>When, in offline mode, Cocoon generates a page ending in a <code>/</code>, the resultant file cannot be
written to a filesystem as its name would refer specifically to a directory. Therefore, the user can
specify a default filename which will be appended to the page's URI before saving to disc.</p>
</s2>
</s1>
<s1 title="URIs and Targets">
<s2 title="SourceURIs">
<p>A source URI (which may also have a source prefix prepended) is the part of the URI that is given
to Cocoon for processing. So, for example, if you access a page with:
<code>http://localhost:8080/cocoon/site/page.html</code> then the source URI would be
<code>site/page.html</code></p>
</s2>
<s2 title="Destinations and Modifiable Sources">
<p>Most of the time, when generating pages, the generated pages will be simply written to disk.</p>
<p>However, this is not the only option. Generated pages can be written anywhere for which a
<code>ModifiableSource</code> exists. So, for example, it is possible to generate a site and
have the pages written directly to a web server using FTP, by making use of the Avalon
<code>FTPSource</code>.</p>
</s2>
<s2 title="Target Types">
<p>When generating a page, Cocoon needs to know how to decide upon the URI of the generated page.
This process could be described as 'URI arithmetic'.</p>
<p>Source and destination URIs are made up of the following elements:</p>
<ul>
<li>Source Prefix: Part of a source URI used to request a page but excluded from the destination
URI</li>
<li>Source URI: Part of a source URI that is used when calculating the destination URI</li>
<li>Destination URI: The base URI for a destination</li>
<li>Type: The method used for merging the above elements (can be append, replace or
insert</li>
</ul>
<note>When combining elements to make a URI, it is the user's responsibility to include directory
separators. For example, <code>foo</code> with <code>bar</code> appended will be
<code>foobar</code>, whereas <code>foo/</code> with <code>bar</code> appended will be
<code>foo/bar</code>.
</note>
<s3 title="Appending">
<p>Here, when calculating the destination URI, the source prefix is ignored, and the destination
URI is calculated by appending the source URI to the end of the destination URI. For example,
with the following values:</p>
<p>Source prefix: <code>site/</code>, source URI: <code>page.html</code>, destination URI:
<code>pages/</code></p>
<p>A request will be made to Cocoon for a page at: <code>site/page.html</code>. This will be
saved as <code>pages/page.html</code>.</p>
</s3>
<s3 title="Replacing">
<p>Here, when calculating the destination URI, the source prefix and the source URI are
ignored, and the destination URI is used as is. This is useful when you wish to save the
generated page with a filename that bears no relationship to the source URI. For example,
with the following values:</p>
<p>Source prefix: <code>site/</code>, source URI: <code>page.html</code>, destination URI:
<code>pages/simple.html</code></p>
<p>A request will be made to Cocoon for a page at: <code>site/page.html</code>. This will be
saved as <code>pages/simple.html</code>.</p>
<note>Given the nature of this target type, it inherently cannot be used when following links
(otherwise all pages will be written on top of each other).</note>
</s3>
<s3 title="Inserting">
<p>Here, when calculating the destination URI, the source prefix is ignored, and the source URI
is inserted into the destination URI at the point marked by an asterisk (*). This is intended
for use with complex protocols where the source URI does not appear at the end of the
destination URI.</p>
</s3>
</s2>
<s2 title="Mime Type Checking">
<p>Cocoon can optionally test the mime type for a page, and, if the mime type doesn't match the page's
extension, amend the destination URI to include the correct extension. This will ensure that pages
will load correctly when served by a static web server.</p>
<p>When Cocoon amends a destination URI, it also amends URIs for links in those pages, so that links
will still work when a site has been crawled.</p>
<note>This feature substantially slows down page generation, as each page must be generated three times,
(once to find links, once to find its mime-type and once to collect the actual content. This
can be avoided by ensuring that all URIs in the site are correct and do not need amending, in which
case it is only necessary to generate a page once.</note>
</s2>
</s1>
<s1 title="Following Links and Site Crawling">
<p>Cocoon can be configured to either follow, or ignore, links in pages that it generates. It has two methods
of gathering links, 'link view' and 'link gathering'.</p>
<s2 title="Link View Crawling">
<p>With link view crawling, Cocoon gets the links by generating the 'link view' for a page. Using link view
gives a significant degree of configurability in terms of which links are gathered, as it is possible to
insert a transformer into the view to select out links that should not be followed.</p>
<p>The disadvantage with link view crawling is that each page must be generated twice, which doubles page
generation time.</p>
<p>Link view is usually configured in the root sitemap with:</p>
<source>
<![CDATA[
<map:views>
<map:view from-position="last" name="links">
<map:serialize type="links"/>
</map:view>
</map:views>
]]>
</source>
<p>If you have this in your root sitemap, you do not need it in your sub-sitemaps. However, you may choose
to override it with one that carries our further processing - for example, with an XSLT transformer that
removes links that should not be crawled.</p>
<p>See <link href="../concepts/views.html">views</link> for more on views. </p>
<p>You can see the link view yourself by appending <code>?cocoon-view=links</code> to the page's URI.</p>
</s2>
<s2 title="Link Gathering Crawling">
<p>With link gathering crawling, links are gathered from the SAX stream right before the serializer. All
<code>src</code>, <code>href</code> and <code>xlink:href</code> attributes are taken to be links, and are
therefore followed.</p>
<p>The benefit of link gathering crawling is that pages do not need to be generated twice. However, one looses
the ability to configure which links should be followed that exists with link view crawling.</p>
</s2>
</s1>
<s1 title="Broken Links">
<p>When a page cannot be found at a URI that has either been specified, or has been found as a link in another
page, it is considered 'broken'.</p>
<p>Exactly what is done when a broken link is found depends upon the method used to evoke
Cocoon. See related pages for specific details.</p>
<s2 title="Broken Link Handling using xconf Configuration method">
<p>The xconf method allows for more sophisticated broken link handling. The
user can select to have broken links reported to a file, this file being
either text or XML.</p>
<p>When this file is plain text, it will have one link URI per line.</p>
<p>When this file is in XML, it will detail a message explaining the reason
for the broken link, as well as the URI of the link.</p>
<p>It is also possible to specify whether an error page should be generated
in the place of the broken page (based upon the configured
<code>&lt;map:handle-errors&gt;</code> code in the sitemap). If required,
an extension can be appended to the original file's URI to signify that
it is an error page (e.g. <code>.error</code>).</p>
</s2>
</s1>
<s1 title="Precompiling XSPs">
<p>When used offline, Cocoon can precompile XSP pages. If no URIs are specified, it will scan all directories
within the context directory looking for XSP files, each of which will be compiled. If URIs are specified,
all links will be followed looking for pages that make use of XSP, compiling those XSP pages as they are
found.</p>
</s1>
</body>
</document>