The SDAP Collection Manager is a service that watches a YAML file (the Collections Configuration file) stored on the filesystem, and all the directories listed in that file. Whenever new granules are added to any of the watched directories, the Collection Manager service will publish a message to RabbitMQ to be picked up by the Granule Ingester (/granule_ingester
in this repository), which will then ingest the new granules.
Python 3.7
Use a conda environment for example:
$ conda create -n cmenv python=3.7 $ conda activate cmenv
From incubator-sdap-ingester
, run:
$ cd common && python setup.py install $ cd ../collection_manager python setup.py install
From incubator-sdap-ingester
, run:
$ python collection_manager/collection_manager/main.py -h
A path to a collections configuration file must be passed in to the Collection Manager at startup via the --collections-path
parameter. Below is an example of what the collections configuration file could look like:
# collections.yaml collections: # The identifier for the dataset as it will appear in NEXUS. - id: "CSR-RL06-Mascons_LAND" # The path to watch for NetCDF granule files to be associated with this dataset. # This can also be an S3 path prefix, for example "s3://my-bucket/path/to/granules/" path: "/data/CSR-RL06-Mascons-land/" # An integer priority level to use when publishing messages to RabbitMQ for historical data. # Higher number = higher priority. Scale is 1-10. priority: 1 # An integer priority level to use when publishing messages to RabbitMQ for forward-processing data. # Higher number = higher priority. Scale is 1-10. forward-processing-priority: 5 # The type of project to use when processing granules in this collection. # Accepted values are Grid, ECCO, TimeSeries, or Swath. projection: Grid dimensionNames: # The name of the primary variable variable: lwe_thickness # The name of the latitude variable latitude: lat # The name of the longitude variable longitude: lon # The name of the depth variable (only include if depth variable exists) depth: Z # The name of the time variable (only include if time variable exists) time: Time # This section is an index of each dimension on which the primary variable is dependent, mapped to their desired slice sizes. slices: Z: 1 Time: 1 lat: 60 lon: 60 - id: ocean-bottom-pressure path: /data/OBP/ priority: 6 forward-processing-priority: 7 projection: ECCO dimensionNames: latitude: YC longitude: XC time: time # "tile" is required when using the ECCO projection. This refers to the name of the dimension containing the ECCO tile index. tile: tile variable: OBP slices: time: 1 tile: 1 i: 30 j: 30
Note that the dimensions listed under slices
will not necessarily match the values of the properties under dimensionNames
. This is because sometimes the actual dimensions are referenced by index variables.
Tip: An easy way to determine which variables go under
dimensionNames
and which ones go underslices
is that the variables on which the primary variable is dependent should be listed underslices
, and the variables on which those variables are dependent (which could be themselves, as in the case of the first collection in the above example) should be the values of the properties underdimensionNames
. The excepction to this is thatdimensionNames.variable
should always be the name of the primary variable.
From incubator-sdap-ingester/
, run:
$ cd common && python setup.py install $ cd ../collection_manager && python setup.py install $ pip install pytest && pytest
From incubator-sdap-ingester
, run:
$ docker build . -f collection_manager/docker/Dockerfile -t nexusjpl/collection-manager