blob: e6d31e186e8ec0ebb712d68625438d64197f70ed [file] [log] [blame] [view]
[![Apache Sling](https://sling.apache.org/res/logos/sling.png)](https://sling.apache.org)
 [![Build Status](https://ci-builds.apache.org/job/Sling/job/modules/job/sling-org-apache-sling-sitemap/job/master/badge/icon)](https://ci-builds.apache.org/job/Sling/job/modules/job/sling-org-apache-sling-sitemap/job/master/) [![Test Status](https://img.shields.io/jenkins/tests.svg?jobUrl=https://ci-builds.apache.org/job/Sling/job/modules/job/sling-org-apache-sling-sitemap/job/master/)](https://ci-builds.apache.org/job/Sling/job/modules/job/sling-org-apache-sling-sitemap/job/master/test/?width=800&height=600) [![Coverage](https://sonarcloud.io/api/project_badges/measure?project=apache_sling-org-apache-sling-sitemap&metric=coverage)](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-sitemap) [![Sonarcloud Status](https://sonarcloud.io/api/project_badges/measure?project=apache_sling-org-apache-sling-sitemap&metric=alert_status)](https://sonarcloud.io/dashboard?id=apache_sling-org-apache-sling-sitemap) [![JavaDoc](https://www.javadoc.io/badge/org.apache.sling/org.apache.sling.sitemap.svg)](https://www.javadoc.io/doc/org.apache.sling/org.apache.sling.sitemap) [![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.apache.sling/org.apache.sling.sitemap/badge.svg)](https://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.apache.sling%22%20a%3A%22org.apache.sling.sitemap%22) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
# Apache Sling Sitemap
The Apache Sling Sitemap module is an extension for Apache Sling that helps to serve XML Sitemaps to Search Engines. It
was designed to cover various uses cases from small sites serving sitemaps on-demand, large sites generating them in the
background to even sites that collect 3rd party data to include dynamically rendered pages.
## Highlights
* A simple, builder-like API to create Sitemaps, that hides all the XML specifics from the implementation
* Support for on-demand and background generation w/ continuation after job interruption
* Support for nested sitemaps, that are automatically collected into a sitemap indexes
* Support for auto-balancing sitemaps into multiple files for background generation
## Getting Started
The Sling Sitemap module is rather abstract, as it highly depends on the content structure of the application it is
used in. To get started a few things must be done:
1) Implement at least one `SitemapGenerator`. The abstract `ResourceTreeSitemapGenerator` may be a good starting
point for any generator walking the resource tree.
2) Configure the `SitemapServlet` to register the resource type(s) which may be a sitemap root resource.
3) Configure a service user mapping for `org.apache.sling.sitemap:sitemap-reader` granting read access to the content.
4) Configure another service user mapping one for `org.apache.sling.sitemap:sitemap-writer` granting read access to the
content and write access to the storage path for background generation (per default /var/sitemaps).
5) Finally, either configure a `SitemapScheduler` to create a job for background generation, or implement the
`SitemapGenerator` to serve the sitemaps on-demand.
## Implementation Details
### Content Model
In order to serve a sitemap a resource must be marked as sitemap root resource. This is done by adding
a `sling:sitemapRoot = true` property either to the resource or its `jcr:content` child.
When multiple resources in a resource tree are marked as sitemap roots, the one closest to the repository root is
considered to top level sitemap root and serves a sitemap-index additionally to the sitemap.
```
/content/site/ch/
+ de-ch/
- sitemapRoot = true
+ faqs/
+ news/
+ products/
- sitemapRoot = true
+ fr-ch/
- sitemapRoot = true
...
```
In the example above the paths `de-ch/` and `fr-ch/` are both top level sitemap roots and `de-ch/products/` is a nested
sitemap root. The sitemaps will be served as following, assuming the appropriate mappings being in place:
```
https://site.ch/de-ch.sitemap-index.xml
https://site.ch/de-ch.sitemap.xml
https://site.ch/de-ch.sitemap.products-sitemap.xml
https://site.ch/fr-ch.sitemap-index.xml
https://site.ch/fr-ch.sitemap.xml
```
### Sitemap Generation
The module does not ship a specific `SitemapGenerator` implementation. Products/Projects using the Apache Sling Sitemap
module must implement an appropriate `SitemapGenerator` that fits their content model. An abstract
`ResourcceTreeSitemapGenator` implementation is available to cover the most common use cases.
Each `SitemapGenerator` may produce multiple sitemaps for a given sitemap root. For example, a default sitemap and a news
specific sitemap that contains only up to 1000 urls that were changed in the past 2 days. Or as another example,
consider an eCommerce site, that generates a product sitemap for each top level category of a catalog. To enable that,
a `SitemapGenerator` can return _0..n_ names for a given resource, each name representing a single sitemap at the given
resource.
`SitemapGenerator` implementations need to be registered as OSGI services. In case of an overlap, it depends on
the `service.ranking` which `SitemapGenerators` are used for a particular sitemap root. For example, it may be that two
`SitemapGenerators` can generate a default sitemap for a sitemap root, but the second one can also generate a news
sitemap. The first, higher-ranked `SitemapGenerator` will be used for the default sitemap, but the second one will still
be taken into account for the news sitemap.
#### Background Generation
A configurable scheduler can be configured to trigger background generation. Multiple schedules can be created for
different sitemap names or generators as described before. This enables the background generation to align on the
cadence in which particular content may be updated, for example product catalogues that sync on a regular basis.
For each sitemap root in the repository and for each sitemap name returned for them, a job with the topic
`org/apache/sling/sitemap/build` will be queued. The `SitemapGeneratorExecutor` implementation consumes those jobs and
calls the corresponding `SitemapGeneator`. It is recommended to create an unordered queue for those jobs so that they
can be distributed across multiple instances within a cluster.
The `SitemapGeneratorExecutor` provides an execution context to the `SitemapGenerator` which it may use to keep track of
progress. The implementation on the other hand will persist this state along with the already written sitemap after
a configurable amount of urls has been added. This allows to resume jobs after an instance gets restarted or discarded
in a dynamic cluster. Per default the `SitemapGeneratorExecutor` is configured with a chunk size of `Integer.MAX_VALUE`,
which effectively means that no checkpoints will be written. When using this feature make sure to find a good balance
between write overhead and performance gain for those particular cases.
Background generation supports auto-balancing according to configurable limits for size (in bytes), and the number of
urls in a single sitemap file. This is transparently handled by the `SitemapGeneratorExecutor`, providing a `Sitemap`
instance which pipes added urls to multiple files when needed. As a consequence returning sitemap files from storage for
a given name and sitemap root may result in multiple return values.
#### On-demand Generation
For smaller sites, calculating sitemaps in the background may not be necessary and serving sitemaps when they get
requested may even result in higher accuracy. On the other hand serving a sitemap on-demand within the timeout of
different crawlers highly depends on the amount of content and the `SitemapGenerator` implementation(s) used. Because of
that, serving sitemaps on-demand must be explicitly enabled.
To enable serving sitemaps on-demand, a `SitemapGenerator` must indicate that a particular sitemap name should be served
on demand. Alternatively the `SitemapGeneratorManagerImpl` can be configured to force all sitemaps to be served
on-demand. In both cases the `SitemapServlet` changes its behaviour slightly:
- When serving a sitemap-index, it queries for all sitemap roots and adds the sitemaps of those, that should be served
on-demand. Additionally, all sitemaps form the top level sitemap root's storage location are added if not already
present.
- For serving sitemap files it first checks if the sitemap should be served on-demand, if not it falls back to the
sitemap file from the top level sitemap root's storage location.
**On-demand generation does NOT support auto-balancing. The configured limits will be ignored.**
### Sitemap Extensions
The builder-like API supports adding sitemap extensions (image, video, news, alternate languages ...) by implementing an
`ExtensionProvider`. This provider has to be registered specifying `namespace`, `prefix`, and `localName` of the xml
element the extension adds to an url object.
In order to hide the implementation detail from the consumer API, the `ExtensionProvider` works with an abstract
`AbstractExtension` class, and the consumer API only with an `Extension` marker interface. To implement an Extension:
* Extend the marker interface with methods to capture the extension specific data
* Implement this extension interface by extending the `AbstractExtension` class
* Provide an instance of the extension implementation by implementing an `ExtensionProvider`
* Make sure to register the `ExtensionProvider` with the `extension.interface` set to the fqn of the extension interface
An example extension implementation can be found with
the [AlternateLanguageExtension](src/main/java/org/apache/sling/sitemap/builder/extensions/AlternateLanguageExtension.java).
For the following implemented extensions, refer to the respective interfaces in [o.a.s.sitemap.builder.extensions](src/main/java/org/apache/sling/sitemap/builder/extensions):
* [Alternate Language Links](https://developers.google.com/search/docs/advanced/crawling/localized-versions#sitemap)
* [Google Image Sitemap](https://developers.google.com/search/docs/advanced/sitemaps/image-sitemaps)
* [Google Video Sitemap](https://developers.google.com/search/docs/advanced/sitemaps/video-sitemaps)
* [Google News Sitemap](https://developers.google.com/search/docs/advanced/sitemaps/news-sitemap)