| <?xml version="1.0" encoding="UTF-8"?> |
| <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.0//EN" "../../dtd/document-v10.dtd"> |
| |
| <document> |
| <header> |
| <title>Offline Page Generation</title> |
| <version>1.0</version> |
| <type>Technical document</type> |
| <authors><person name="Upayavira" email="upayavira@apache.org"/> |
| </authors> |
| <abstract>This document explains the basic concepts of offline page generation with Apache Cocoon.</abstract> |
| </header> |
| <body> |
| <s1 title="Overview"> |
| <p>Cocoon can generate static, 'offline' versions of web pages or web sites, as well |
| as sites served dynamically. This document covers the concepts involved in offline |
| page and site generation. |
| </p> |
| </s1> |
| <s1 title="Offline Page Generation"> |
| <p>Cocoon allows static versions of Cocoon web sites to be created.</p> |
| <p>At present, this can be done in three ways:</p> |
| <ul> |
| <li><link href="cli.html">Command Line Interface</link></li> |
| <li><link href="ant.html">Using Ant Task</link></li> |
| <li><link href="bean.html">Cocoon Bean</link></li> |
| </ul> |
| <p>This document explains the general concepts that are shared by all of these approaches. |
| The specific details for each method are explained on a separate page.</p> |
| <p>Cocoon, when generating pages offline, can follow links in a page (whether that page |
| is HTML, PDF or anything else), and can rewrite URIs to create filenames by checking |
| the mime type of the generated page. All links to pages who's URIs change are changed |
| too. |
| </p> |
| </s1> |
| <s1 title="Configuration"> |
| <p>To use Cocoon in its offline mode, a servlet container (e.g. Tomcat or Jetty) is not |
| needed. Cocoon can generate an offline site directly using the information available |
| in the Cocoon <code>webapp</code> folder.</p> |
| <p>Having said this, many choose to have a servlet container available locally for use |
| whilst debugging, as this can speed up the development process significantly.</p> |
| <s2 title="Directories and Files"> |
| <p>As all the information Cocoon needs to generate a site is stored in the Cocoon |
| webapp directory, we need to tell it where to find it, and where to find various |
| other files and directories. These are:</p> |
| <ul> |
| <li>Context directory (the Cocoon Webapp directory)</li> |
| <li>Configuration File (usually <code>${COCOON_WEBAPP}/WEB-INF/cocoon.xconf</code>)</li> |
| <li>Work Directory (used by Cocoon to store temporary files, this can be anywhere of your choosing)</li> |
| </ul> |
| </s2> |
| <s2 title="Logging"> |
| <p>There are three options that need to be specified in relation to logging. These are:</p> |
| <ul> |
| <li>Log Kit (the logging configuration file, usually <code>${COCOON_WEBAPP}/WEB-INF/logkit.xconf</code>)</li> |
| <li>Logger (a category used for logging, as configured in the configuration file)</li> |
| <li>Log Level (a logging level, either DEBUG, INFO WARN, ERROR or FATAL_ERROR. Relates specifically to logging |
| at startup, after which log kit configuration takes over)</li> |
| </ul> |
| </s2> |
| <s2 title="Other Configuration Options"> |
| <p>In online mode, a User agent string tells Cocoon what browser is being used to access a page. The user agent |
| can be configured manually for offline generation.</p> |
| <p>In online mode, an accept string is provided by a browser, telling the browser what types of content it |
| is capable of accepting. This will be a comma separated list of mime types. In offline mode, an accept |
| string can also be specified.</p> |
| <p>As Cocoon based sites can change the content they generate based upon the user agent string and the accepts |
| string, it can be necessary to specify them in order to have the correct content generated.</p> |
| <p>In order to generate sites that make use of databases and database connections, it is necessary to load |
| JDBC classes at startup. Cocoon allows for this.</p> |
| <p>When, in offline mode, Cocoon generates a page ending in a <code>/</code>, the resultant file cannot be |
| written to a filesystem as its name would refer specifically to a directory. Therefore, the user can |
| specify a default filename which will be appended to the page's URI before saving to disc.</p> |
| </s2> |
| </s1> |
| <s1 title="URIs and Targets"> |
| <s2 title="SourceURIs"> |
| <p>A source URI (which may also have a source prefix prepended) is the part of the URI that is given |
| to Cocoon for processing. So, for example, if you access a page with: |
| <code>http://localhost:8080/cocoon/site/page.html</code> then the source URI would be |
| <code>site/page.html</code></p> |
| </s2> |
| <s2 title="Destinations and Modifiable Sources"> |
| <p>Most of the time, when generating pages, the generated pages will be simply written to disk.</p> |
| <p>However, this is not the only option. Generated pages can be written anywhere for which a |
| <code>ModifiableSource</code> exists. So, for example, it is possible to generate a site and |
| have the pages written directly to a web server using FTP, by making use of the Avalon |
| <code>FTPSource</code>.</p> |
| </s2> |
| <s2 title="Target Types"> |
| <p>When generating a page, Cocoon needs to know how to decide upon the URI of the generated page. |
| This process could be described as 'URI arithmetic'.</p> |
| <p>Source and destination URIs are made up of the following elements:</p> |
| <ul> |
| <li>Source Prefix: Part of a source URI used to request a page but excluded from the destination |
| URI</li> |
| <li>Source URI: Part of a source URI that is used when calculating the destination URI</li> |
| <li>Destination URI: The base URI for a destination</li> |
| <li>Type: The method used for merging the above elements (can be append, replace or |
| insert</li> |
| </ul> |
| <note>When combining elements to make a URI, it is the user's responsibility to include directory |
| separators. For example, <code>foo</code> with <code>bar</code> appended will be |
| <code>foobar</code>, whereas <code>foo/</code> with <code>bar</code> appended will be |
| <code>foo/bar</code>. |
| </note> |
| <s3 title="Appending"> |
| <p>Here, when calculating the destination URI, the source prefix is ignored, and the destination |
| URI is calculated by appending the source URI to the end of the destination URI. For example, |
| with the following values:</p> |
| <p>Source prefix: <code>site/</code>, source URI: <code>page.html</code>, destination URI: |
| <code>pages/</code></p> |
| <p>A request will be made to Cocoon for a page at: <code>site/page.html</code>. This will be |
| saved as <code>pages/page.html</code>.</p> |
| </s3> |
| <s3 title="Replacing"> |
| <p>Here, when calculating the destination URI, the source prefix and the source URI are |
| ignored, and the destination URI is used as is. This is useful when you wish to save the |
| generated page with a filename that bears no relationship to the source URI. For example, |
| with the following values:</p> |
| <p>Source prefix: <code>site/</code>, source URI: <code>page.html</code>, destination URI: |
| <code>pages/simple.html</code></p> |
| <p>A request will be made to Cocoon for a page at: <code>site/page.html</code>. This will be |
| saved as <code>pages/simple.html</code>.</p> |
| <note>Given the nature of this target type, it inherently cannot be used when following links |
| (otherwise all pages will be written on top of each other).</note> |
| </s3> |
| <s3 title="Inserting"> |
| <p>Here, when calculating the destination URI, the source prefix is ignored, and the source URI |
| is inserted into the destination URI at the point marked by an asterisk (*). This is intended |
| for use with complex protocols where the source URI does not appear at the end of the |
| destination URI.</p> |
| </s3> |
| </s2> |
| <s2 title="Mime Type Checking"> |
| <p>Cocoon can optionally test the mime type for a page, and, if the mime type doesn't match the page's |
| extension, amend the destination URI to include the correct extension. This will ensure that pages |
| will load correctly when served by a static web server.</p> |
| <p>When Cocoon amends a destination URI, it also amends URIs for links in those pages, so that links |
| will still work when a site has been crawled.</p> |
| <note>This feature substantially slows down page generation, as each page must be generated three times, |
| (once to find links, once to find its mime-type and once to collect the actual content. This |
| can be avoided by ensuring that all URIs in the site are correct and do not need amending, in which |
| case it is only necessary to generate a page once.</note> |
| </s2> |
| </s1> |
| <s1 title="Following Links and Site Crawling"> |
| <p>Cocoon can be configured to either follow, or ignore, links in pages that it generates. It has two methods |
| of gathering links, 'link view' and 'link gathering'.</p> |
| <s2 title="Link View Crawling"> |
| <p>With link view crawling, Cocoon gets the links by generating the 'link view' for a page. Using link view |
| gives a significant degree of configurability in terms of which links are gathered, as it is possible to |
| insert a transformer into the view to select out links that should not be followed.</p> |
| <p>The disadvantage with link view crawling is that each page must be generated twice, which doubles page |
| generation time.</p> |
| <p>Link view is usually configured in the root sitemap with:</p> |
| <source> |
| <![CDATA[ |
| |
| <map:views> |
| |
| <map:view from-position="last" name="links"> |
| <map:serialize type="links"/> |
| </map:view> |
| |
| </map:views> |
| ]]> |
| </source> |
| <p>If you have this in your root sitemap, you do not need it in your sub-sitemaps. However, you may choose |
| to override it with one that carries our further processing - for example, with an XSLT transformer that |
| removes links that should not be crawled.</p> |
| <p>See <link href="../concepts/views.html">views</link> for more on views. </p> |
| <p>You can see the link view yourself by appending <code>?cocoon-view=links</code> to the page's URI.</p> |
| </s2> |
| <s2 title="Link Gathering Crawling"> |
| <p>With link gathering crawling, links are gathered from the SAX stream right before the serializer. All |
| <code>src</code>, <code>href</code> and <code>xlink:href</code> attributes are taken to be links, and are |
| therefore followed.</p> |
| <p>The benefit of link gathering crawling is that pages do not need to be generated twice. However, one looses |
| the ability to configure which links should be followed that exists with link view crawling.</p> |
| </s2> |
| </s1> |
| <s1 title="Broken Links"> |
| <p>When a page cannot be found at a URI that has either been specified, or has been found as a link in another |
| page, it is considered 'broken'.</p> |
| <p>Exactly what is done when a broken link is found depends upon the method used to evoke |
| Cocoon. See related pages for specific details.</p> |
| <s2 title="Broken Link Handling using xconf Configuration method"> |
| <p>The xconf method allows for more sophisticated broken link handling. The |
| user can select to have broken links reported to a file, this file being |
| either text or XML.</p> |
| <p>When this file is plain text, it will have one link URI per line.</p> |
| <p>When this file is in XML, it will detail a message explaining the reason |
| for the broken link, as well as the URI of the link.</p> |
| <p>It is also possible to specify whether an error page should be generated |
| in the place of the broken page (based upon the configured |
| <code><map:handle-errors></code> code in the sitemap). If required, |
| an extension can be appended to the original file's URI to signify that |
| it is an error page (e.g. <code>.error</code>).</p> |
| </s2> |
| </s1> |
| <s1 title="Precompiling XSPs"> |
| <p>When used offline, Cocoon can precompile XSP pages. If no URIs are specified, it will scan all directories |
| within the context directory looking for XSP files, each of which will be compiled. If URIs are specified, |
| all links will be followed looking for pages that make use of XSP, compiling those XSP pages as they are |
| found.</p> |
| </s1> |
| </body> |
| </document> |
| |