Caching ClassLoader Factory

CachingClassLoaderFactory implements Accumulo's ContextClassLoaderFactory SPI. This implementation is designed around a set of remote classpath resource files listed in a JSON-formatted manifest for each context, and a local cache directory that is intended to be shared across multiple processes.

The URL to the manifest, as a String, is used for the context parameter. In return, this factory downloads the resource from the remote locations specified in the manifest to a local storage cache, where they will be used to produce a corresponding ClassLoader instance. The specified URL will then be monitored for any changes to the manifest at that location, at the monitoring interval specified in the manifest.

Resources must be jar files (jar, war, ear, etc.). Other resource file types (like text or data files) must be contained within a jar file and can be read from the classloader using .getResourceAsStream().

Introduction

This factory creates ClassLoader instances that point to locally cached copies of remote resource files. In this way, this factory allows placing common resources in a remote location for use across many hosts.

This factory uses a storage cache in the local file system for any files it downloads from a remote URL.

To use this factory, one must store resource files in a location that can be specified by a supported URL, and then must create a JSON-formatted manifest file that contains an optional comment, a monitoring interval (in seconds, greater than 0), and a list of resource URLs along with a checksum for each resource file. This manifest file must then be stored somewhere where this factory can download it, and use the URL to that file as the context parameter for this factory's getClassLoader(String context) method.

This factory can handle manifest and resource URLs of any type that are supported by your application via a registered URLStreamHandlerProvider, such as the built-in file: and http: types. A provider that handles hdfs: URL types must be provided by the user. This may be provided by the Apache Hadoop project, or by another library. A reference implementation is available elsewhere in this project.

Here is an example manifest:

{
  "comment": "an optional comment",
  "monitorIntervalSeconds": 5,
  "resources": [
    {
      "location": "file:/home/user/ClassLoaderTestA/TestA.jar",
      "algorithm": "MD5",
      "checksum": "ae5e8248a9243751d60dbcaaedeb93ba"
    },
    {
      "location": "hdfs://localhost:8020/contextB/TestB.jar",
      "algorithm": "SHA-256",
      "checksum": "ed95fe130090fd64c2caddc3e37555ada8ba49a91bfd8ec1fd5c989d340ad0e0"
    },
    {
      "location": "http://localhost:80/TestC.jar",
      "algorithm": "SHA3-224",
      "checksum": "958f12ddc5acf87c2fe0ceed645327bb0c92e268acf915c4a374c14b"
    },
    {
      "location": "https://localhost:80/TestD.jar",
      "algorithm": "SHA-512/224",
      "checksum": "f7f982521ceb8ca97662973ada9b92b86de6bbaf233f14fd47efd792"
    }
  ]
}

Quick Start

The following are the steps to use this context classloader with jars accessible via URL.

Add the jar for this project to the Accumulo classpath.
Optionally add the jar for the hdfs-urlstreamhandler-provider project to the Accumulo classpath. This will enable loading jars from HDFS URLs.
Set the following Accumulo properties:
- general.context.class.loader.factory=org.apache.accumulo.classloader.ccl.CachingClassLoaderFactory
- general.custom.classloader.ccl.cache.dir=file:/absolute/path/to/some/local/directory
- general.custom.classloader.ccl.allowed.urls.pattern=someRegexPatternForAllowedUrls
- (optional) general.custom.classloader.ccl.update.grace.minutes=30
Restart Accumulo (required for the first two properties; the others can be changed on a running system)
Create a manifest like the example above, either manually, or by using the provided tool (see “Creating a Manifest” section below) and make it accessible with a URL.
Set the following table property to link to a manifest's URL. For example:
- table.class.loader.context=(file|hdfs|http|https)://path/to/context/manifest.json
Repeat steps 5 and 6 for different contexts and/or tables.

How it Works

When this factory receives a request for a ClassLoader for a given URL, it downloads a copy of the manifest and parses it. If it has recently acquired that manifest, based on the monitoring interval from a previous retrieval, it can skip this step and use the manifest from the earlier retrieval, which is kept up-to-date by the background monitoring thread that started when it was previously retrieved. Once it has the manifest, it then returns a ClassLoader instance containing the resources defined in that manifest, first downloading any missing resources and verifying them using the checksums in manifest.

ClassLoader instances are stored in a de-deduplicating cache in memory with a minimum lifetime of 24 hours. So, no two instances will ever exist in a process for the same manifest.

If this manifest had not previously been downloaded, a background monitoring task is set up to ensure the URL is watched for any changes to the manifest. This monitoring continues for as long as there exists ClassLoader instances in the system that were constructed from the manifest at that URL (at least 24 hours, since that is the minimum time they will exist in the de-duplicating cache).

Local Storage Cache

The local storage cache location is configured by the user by setting the required Accumulo property named general.custom.classloader.ccl.cache.dir to a directory on the local file system. This location may be specified as an absolute path or as a URL representing an absolute path with the file scheme. The location, and its directory structure will be created on first use, if it does not already exist. This will cause an error if the application does not have permission to create the directories.

The selected location should be a persistent location with plenty of space to store downloaded resources, and should be writable by all the processes which use this factory to share the same resources. You may wish to pre-create the base directory specified by the property, and the three sub-directories, manifests, resources, and working, to set the appropriate permissions and ACLs.

Resources downloaded to this cache may be used by multiple manifests, threads, and processes, so be very careful when removing old contents to ensure that they are no longer needed. If a resource file is deleted from the local storage cache while a ClassLoader exists that references it, that ClassLoader may, and probably will, stop working correctly. Similarly, files that have been downloaded should not be modified, because any modification will likely cause unexpected behavior to classloaders still using the file.

Do NOT use a temporary directory for the local storage cache location.
The local storage cache location MUST use a file system that supports atomic moves and hard links.

Security

The Accumulo property general.custom.classloader.ccl.allowed.urls.pattern is another required parameter. It is used to limit the allowed URLs that can be fetched when downloading manifests or resources. Since the process using this factory will be using its own permissions to fetch resources, and placing a copy of those resources in a local directory where others may access them, that presents presents a potential file disclosure security risk. This property allows a system administrator to mitigate that risk by restricting access to only approved URLs. (e.g. to exclude non-approved locations like file:/path/to/accumulo.properties or hdfs://host/path/to/accumulo/rfile.rf).

An example value for this property might look like: https://example.com/path/to/manifests/.* or (file:/etc|hdfs://example[.]com:9000)/path/to/manifests/.*

Note: this property affects all URLs fetched by this factory, including manifest URLs and any resource URLs defined inside any fetched manifest. It should be updated by a system maintainer if any new manifests have need to use new locations. It may be updated on a running system, and will take effect after approximately a minute.

Creating a Manifest

Users may take advantage of the Manifest.create(String,int,String,URL[]) method to construct a Manifest object, programmatically. This will calculate the checksums of the classpath elements. Manifest.toJson() can be used to serialize the Manifest to a String to store in a file.

Alternatively, if this library‘s jar is built and placed onto Accumulo’s CLASSPATH, then one can run bin/accumulo create-classloader-manifest to create the manifest using the command-line. The resulting json is printed to stdout and can be redirected to a file. The command takes two arguments:

the monitor interval, in seconds (e.g. -i 300),
an optional checksum algorithm to use (e.g. -a 'SHA3-512'), and
a list of file URLs (e.g. hdfs://host:port/path/to/one.jar file:/host/path/to/two.jar)

Updating a Manifest

This factory uses a background thread to fetch the manifest at its initial URL using the interval specified in the manifest the last time it was retrieved. The manifest at a watched URL can be replaced at any time with any changes, including to changes to the monitor interval and resources. When the manifest is next retrieved, the new manifest will be used as though it were an entirely new manifest at that URL. The next retrieval will occur after the monitor interval read from the most recent retrieval elapses. Changes to the context's resources in any way will result in those new resources being downloaded, verified, and a new ClassLoader instance created and ready to be returned the next time getClassLoader(String context) is called with that URL.

Note: if the contents of a manifest change in only inconsequential ways, such as JSON formatting changes, then those changes will not trigger any new downloads or ClassLoader staging. The is because the manifest JSON is normalized prior to computing its checksum to determine if any changes have occurred.

Error Handling

If there is an exception in creating the initial ClassLoader, such as being unable to retrieve the manifest at the specified URL, or if a resource does not match its checksum in the manifest, then a ContextClassLoaderException is thrown. If there is an exception when updating the classloader, then the exception is logged and the classloader is not updated. Calls to getClassLoader(String context) will return the most recent classloader with valid contents.

The property general.custom.classloader.ccl.update.grace.minutes determines how long the update process continues to return the most recent valid classloader when an exception occurs in the background update thread. A zero value (default) will cause the most recent valid classloader to be returned. Otherwise, if a non-zero number is configured, then monitoring will stop after the update has failed for that number of minutes. Once monitoring has stopped, any subsequent calls to getClassLoader(String context) will behave as it would during an initial request, throwing a ContextClassLoaderException if the manifest cannot be retrieved or a ClassLoader cannot be constructed at that time.

Cleanup

Because the cache directory is shared among multiple processes, and one process can't know what the other processes are doing, this class cannot always clean up the shared cache directory of unused resources automatically. It is left to the user to remove unused files from the cache. The local storage is organized into several directories, which are explained here to aid in understanding when unused files can be safely removed.

Manifests

The manifests directory contents are always safe to delete, but doing so may impair troubleshooting. This directory contains copies of any downloaded manifests, but the behavior of the factory does not depend on the existence of these files.

Resources

The resources directory contains a shared pool of remote resource files that have been fetched for all manifests. The files in this directory are safe to delete any time. However, some considerations should be made:

Deleting resources that are still needed will cause them to be downloaded again the next time they are needed, which may cause an increase in network activity.
If any of the removed files had hard-linked “copies” in the working directory, the newly downloaded copy will increase the total amount of storage (whereas the original would have shared storage space with the hard-linked “copies”).

Working

The working directory contains temporary files for files currently being downloaded, and temporary directories containing hard-linked “copies” of files from the resources directory. These files and directories contain the process ID (PID) for the process that created them. Normally, these files are automatically cleaned up, but if a process is killed before that can happen, they may be left behind. The files with the PID in them can safely be removed, so long as the process that created them has been terminated.

This directory also contains files that do not contain a PID. These files end with the .downloading suffix and exist to signal across processes that a resource file is currently being downloaded by a process. These files are very small, containing only the PID of the most recent process to attempt downloading the file. They are removed when a download completes, or whenever the next time the corresponding resource file is used, if it has already been successfully downloaded by a previously failed process. Removing them won't break the application in any way, but doing so may result in a redundant download, which can result in increased network activity or storage space (see the previous section for considerations regarding the resources directory).