| <!-- |
| |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| https://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| --> |
| # Caching ClassLoader Factory |
| |
| `CachingClassLoaderFactory` implements Accumulo's `ContextClassLoaderFactory` |
| SPI. This implementation is designed around a set of remote classpath resource |
| files listed in a JSON-formatted manifest for each context, and a local cache |
| directory that is intended to be shared across multiple processes. |
| |
| The URL to the manifest, as a String, is used for the context parameter. In |
| return, this factory downloads the resource from the remote locations specified |
| in the manifest to a local storage cache, where they will be used to produce a |
| corresponding `ClassLoader` instance. The specified URL will then be monitored |
| for any changes to the manifest at that location, at the monitoring interval |
| specified in the manifest. |
| |
| Resources must be jar files (jar, war, ear, etc.). Other resource file types |
| (like text or data files) must be contained within a jar file and can be read |
| from the classloader using `.getResourceAsStream()`. |
| |
| ## Introduction |
| |
| This factory creates `ClassLoader` instances that point to locally cached |
| copies of remote resource files. In this way, this factory allows placing |
| common resources in a remote location for use across many hosts. |
| |
| This factory uses a storage cache in the local file system for any files it |
| downloads from a remote URL. |
| |
| To use this factory, one must store resource files in a location that can be |
| specified by a supported URL, and then must create a JSON-formatted manifest |
| file that contains an optional comment, a monitoring interval (in seconds, |
| greater than 0), and a list of resource URLs along with a checksum for each |
| resource file. This manifest file must then be stored somewhere where this |
| factory can download it, and use the URL to that file as the `context` |
| parameter for this factory's `getClassLoader(String context)` method. |
| |
| This factory can handle manifest and resource URLs of any type that are |
| supported by your application via a registered [URLStreamHandlerProvider][1], |
| such as the built-in `file:` and `http:` types. A provider that handles `hdfs:` |
| URL types must be provided by the user. This may be provided by the Apache |
| Hadoop project, or by another library. A reference implementation is available |
| [elsewhere in this project][2]. |
| |
| Here is an example manifest: |
| |
| ```json |
| { |
| "comment": "an optional comment", |
| "monitorIntervalSeconds": 5, |
| "resources": [ |
| { |
| "location": "file:/home/user/ClassLoaderTestA/TestA.jar", |
| "algorithm": "MD5", |
| "checksum": "ae5e8248a9243751d60dbcaaedeb93ba" |
| }, |
| { |
| "location": "hdfs://localhost:8020/contextB/TestB.jar", |
| "algorithm": "SHA-256", |
| "checksum": "ed95fe130090fd64c2caddc3e37555ada8ba49a91bfd8ec1fd5c989d340ad0e0" |
| }, |
| { |
| "location": "http://localhost:80/TestC.jar", |
| "algorithm": "SHA3-224", |
| "checksum": "958f12ddc5acf87c2fe0ceed645327bb0c92e268acf915c4a374c14b" |
| }, |
| { |
| "location": "https://localhost:80/TestD.jar", |
| "algorithm": "SHA-512/224", |
| "checksum": "f7f982521ceb8ca97662973ada9b92b86de6bbaf233f14fd47efd792" |
| } |
| ] |
| } |
| ``` |
| |
| ## Quick Start |
| |
| The following are the steps to use this context classloader with jars |
| accessible via URL. |
| |
| 1. Add the jar for this project to the Accumulo classpath. |
| 2. Optionally add the jar for the [hdfs-urlstreamhandler-provider][2] project |
| to the Accumulo classpath. This will enable loading jars from HDFS URLs. |
| 3. Set the following Accumulo properties: |
| * `general.context.class.loader.factory=org.apache.accumulo.classloader.ccl.CachingClassLoaderFactory` |
| * `general.custom.classloader.ccl.cache.dir=file:/absolute/path/to/some/local/directory` |
| * `general.custom.classloader.ccl.allowed.urls.pattern=someRegexPatternForAllowedUrls` |
| * (optional) `general.custom.classloader.ccl.update.grace.minutes=30` |
| 4. Restart Accumulo (required for the first two properties; the others can be |
| changed on a running system) |
| 5. Create a manifest like the example above, either manually, or by using the |
| provided tool (see "Creating a Manifest" section below) and make it |
| accessible with a URL. |
| 6. Set the following table property to link to a manifest's URL. For example: |
| * `table.class.loader.context=(file|hdfs|http|https)://path/to/context/manifest.json` |
| 7. Repeat steps 5 and 6 for different contexts and/or tables. |
| |
| ## How it Works |
| |
| When this factory receives a request for a `ClassLoader` for a given URL, it |
| downloads a copy of the manifest and parses it. If it has recently acquired |
| that manifest, based on the monitoring interval from a previous retrieval, it |
| can skip this step and use the manifest from the earlier retrieval, which is |
| kept up-to-date by the background monitoring thread that started when it was |
| previously retrieved. Once it has the manifest, it then returns a `ClassLoader` |
| instance containing the resources defined in that manifest, first downloading |
| any missing resources and verifying them using the checksums in manifest. |
| |
| `ClassLoader` instances are stored in a de-deduplicating cache in memory with a |
| minimum lifetime of 24 hours. So, no two instances will ever exist in a process |
| for the same manifest. |
| |
| If this manifest had not previously been downloaded, a background monitoring |
| task is set up to ensure the URL is watched for any changes to the manifest. |
| This monitoring continues for as long as there exists `ClassLoader` instances |
| in the system that were constructed from the manifest at that URL (at least 24 |
| hours, since that is the minimum time they will exist in the de-duplicating |
| cache). |
| |
| ## Local Storage Cache |
| |
| The local storage cache location is configured by the user by setting the |
| required Accumulo property named `general.custom.classloader.ccl.cache.dir` |
| to a directory on the local file system. This location may be specified as an |
| absolute path or as a URL representing an absolute path with the `file` scheme. |
| The location, and its directory structure will be created on first use, if it |
| does not already exist. This will cause an error if the application does not |
| have permission to create the directories. |
| |
| The selected location should be a persistent location with plenty of space to |
| store downloaded resources, and should be writable by all the processes which |
| use this factory to share the same resources. You may wish to pre-create the |
| base directory specified by the property, and the three sub-directories, |
| `manifests`, `resources`, and `working`, to set the appropriate permissions and |
| ACLs. |
| |
| Resources downloaded to this cache may be used by multiple manifests, threads, |
| and processes, so be very careful when removing old contents to ensure that |
| they are no longer needed. If a resource file is deleted from the local storage |
| cache while a `ClassLoader` exists that references it, that `ClassLoader` may, |
| and probably will, stop working correctly. Similarly, files that have been |
| downloaded should not be modified, because any modification will likely cause |
| unexpected behavior to classloaders still using the file. |
| |
| * Do **NOT** use a temporary directory for the local storage cache location. |
| * The local storage cache location **MUST** use a file system that supports |
| atomic moves and hard links. |
| |
| ## Security |
| |
| The Accumulo property `general.custom.classloader.ccl.allowed.urls.pattern` |
| is another required parameter. It is used to limit the allowed URLs that can be |
| fetched when downloading manifests or resources. Since the process using this |
| factory will be using its own permissions to fetch resources, and placing a |
| copy of those resources in a local directory where others may access them, that |
| presents presents a potential file disclosure security risk. This property |
| allows a system administrator to mitigate that risk by restricting access to |
| only approved URLs. (e.g. to exclude non-approved locations like |
| `file:/path/to/accumulo.properties` or |
| `hdfs://host/path/to/accumulo/rfile.rf`). |
| |
| An example value for this property might look like: |
| `https://example.com/path/to/manifests/.*` or |
| `(file:/etc|hdfs://example[.]com:9000)/path/to/manifests/.*` |
| |
| Note: this property affects all URLs fetched by this factory, including |
| manifest URLs and any resource URLs defined inside any fetched manifest. It |
| should be updated by a system maintainer if any new manifests have need to use |
| new locations. It may be updated on a running system, and will take effect |
| after approximately a minute. |
| |
| ## Creating a Manifest |
| |
| Users may take advantage of the `Manifest.create(String,int,String,URL[])` |
| method to construct a `Manifest` object, programmatically. This will calculate |
| the checksums of the classpath elements. `Manifest.toJson()` can be used to |
| serialize the `Manifest` to a `String` to store in a file. |
| |
| Alternatively, if this library's jar is built and placed onto Accumulo's |
| `CLASSPATH`, then one can run `bin/accumulo create-classloader-manifest` to |
| create the manifest using the command-line. The resulting json is printed to |
| stdout and can be redirected to a file. The command takes two arguments: |
| |
| 1. the monitor interval, in seconds (e.g. `-i 300`), |
| 2. an optional checksum algorithm to use (e.g. `-a 'SHA3-512'`), and |
| 3. a list of file URLs (e.g. `hdfs://host:port/path/to/one.jar |
| file:/host/path/to/two.jar`) |
| |
| ## Updating a Manifest |
| |
| This factory uses a background thread to fetch the manifest at its initial URL |
| using the interval specified in the manifest the last time it was retrieved. |
| The manifest at a watched URL can be replaced at any time with any changes, |
| including to changes to the monitor interval and resources. When the manifest |
| is next retrieved, the new manifest will be used as though it were an entirely |
| new manifest at that URL. The next retrieval will occur after the monitor |
| interval read from the most recent retrieval elapses. Changes to the context's |
| resources in any way will result in those new resources being downloaded, |
| verified, and a new `ClassLoader` instance created and ready to be returned the |
| next time `getClassLoader(String context)` is called with that URL. |
| |
| Note: if the contents of a manifest change in only inconsequential ways, such |
| as JSON formatting changes, then those changes will not trigger any new |
| downloads or `ClassLoader` staging. The is because the manifest JSON is |
| normalized prior to computing its checksum to determine if any changes have |
| occurred. |
| |
| ## Error Handling |
| |
| If there is an exception in creating the initial `ClassLoader`, such as being |
| unable to retrieve the manifest at the specified URL, or if a resource does not |
| match its checksum in the manifest, then a `ContextClassLoaderException` is |
| thrown. If there is an exception when updating the classloader, then the |
| exception is logged and the classloader is not updated. Calls to |
| `getClassLoader(String context)` will return the most recent classloader with |
| valid contents. |
| |
| The property `general.custom.classloader.ccl.update.grace.minutes` determines |
| how long the update process continues to return the most recent valid |
| classloader when an exception occurs in the background update thread. A zero |
| value (default) will cause the most recent valid classloader to be returned. |
| Otherwise, if a non-zero number is configured, then monitoring will stop after |
| the update has failed for that number of minutes. Once monitoring has stopped, |
| any subsequent calls to `getClassLoader(String context)` will behave as it |
| would during an initial request, throwing a `ContextClassLoaderException` if |
| the manifest cannot be retrieved or a `ClassLoader` cannot be constructed at |
| that time. |
| |
| ## Cleanup |
| |
| Because the cache directory is shared among multiple processes, and one process |
| can't know what the other processes are doing, this class cannot always clean |
| up the shared cache directory of unused resources automatically. It is left to |
| the user to remove unused files from the cache. The local storage is organized |
| into several directories, which are explained here to aid in understanding when |
| unused files can be safely removed. |
| |
| ### Manifests |
| |
| The `manifests` directory contents are always safe to delete, but doing so may |
| impair troubleshooting. This directory contains copies of any downloaded |
| manifests, but the behavior of the factory does not depend on the existence of |
| these files. |
| |
| ### Resources |
| |
| The `resources` directory contains a shared pool of remote resource files that |
| have been fetched for all manifests. The files in this directory are safe to |
| delete any time. However, some considerations should be made: |
| |
| 1. Deleting resources that are still needed will cause them to be downloaded |
| again the next time they are needed, which may cause an increase in network |
| activity. |
| 2. If any of the removed files had hard-linked "copies" in the `working` |
| directory, the newly downloaded copy will increase the total amount of |
| storage (whereas the original would have shared storage space with the |
| hard-linked "copies"). |
| |
| ### Working |
| |
| The `working` directory contains temporary files for files currently being |
| downloaded, and temporary directories containing hard-linked "copies" of files |
| from the `resources` directory. These files and directories contain the process |
| ID (PID) for the process that created them. Normally, these files are |
| automatically cleaned up, but if a process is killed before that can happen, |
| they may be left behind. The files with the PID in them can safely be removed, |
| so long as the process that created them has been terminated. |
| |
| This directory also contains files that do not contain a PID. These files end |
| with the `.downloading` suffix and exist to signal across processes that a |
| resource file is currently being downloaded by a process. These files are very |
| small, containing only the PID of the most recent process to attempt |
| downloading the file. They are removed when a download completes, or whenever |
| the next time the corresponding resource file is used, if it has already been |
| successfully downloaded by a previously failed process. Removing them won't |
| break the application in any way, but doing so may result in a redundant |
| download, which can result in increased network activity or storage space (see |
| the previous section for considerations regarding the `resources` directory). |
| |
| [1]: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/net/spi/URLStreamHandlerProvider.html |
| [2]: https://github.com/apache/accumulo-classloaders/tree/main/modules/hdfs-urlstreamhandler-provider |