#Introduction There are multiple challenges in dataset configuration management in the context of ETL data processing as ETL infrastructure employs multi-state processing flows to ingest and publish data on HDFS. Here are some examples types of datasets and types of processing:
A typical dataset could be a database table, a Kafka topic, etc. Current the customization of the dataset processing is typically achieved through file/directory blacklists/whitelists in job/flow level configurations. This approach suffers from a number issues:
We want to have a new way to customize the processing of each dataset like enabling/disabling certain types of processing, specific SLAs, access restrictions, retention policies, etc. without previous mentioned problems.
#Dataset Config Management Requirement Design a backend and flexible client library for storing, managing and accessing configuration that can be used to customize the process of thousands of datasets across multiple systems/flows.
###Data Model
###Versioning
###Client library
###Config Store
#Current Dataset Config Management Implementation
At a very high-level, we extend typesafe config with:
###Data model
Config key (configuration node) / config value
For our use cases, we can define each configuration node per data set. All the configuration related to that dataset are specified together.
Essentially, the system provides a mapping from a config key to a config object. Each config key is represented through a URI. The config object is a map from property name to a property value. We refer to this as own config (object) and refer to it through the function own_config(K, property_name) = property_value.
A config key K can import one or more config keys I1, I2, ... . The config key K will inherit any properties from I1, I2, … that are not defined in K. The inheritance is resolved in the order of the keys I1, I2, … etc., i.e. the property will be resolved to the value in the last one that defines the property. This is similar to including configs in typesafe config. We will refer to resulting configuration as own config (object) and denote it though the function resolved_config(K, property_name) = property_value .
We also use the path in the config key URI for implicit tagging. For example, /data/tracking/TOPIC
implicitly imports /data/tracking/
, which implicitly imports /data/
which implicitly imports /
. Note that all these URI are considered as config Key so their path level implicitly indicates importation. For a given config key, all implicit imports are before the explicit imports, i.e. they have lower priority in resolution. Typical use case for this implicit importation can be a global default configuration file in root path applied to all files under it. Files in this root path can have their own setting overriding the default value inherited from root path's file.
Tags
For our use cases, we can define the static tags in a well known file per dataset.
Dynamic tags Some tags cannot be applied statically at “compile” time. For example, such are cluster-specific tags since they are on the environment where the client application runs. We will support such tags about allowing the use of limited number of variables when importing another config key. For example, such a variable can be “local_cluster.name”. Then, importing /data/tracking/${local_cluster.name} can provide cluster-specific overrides.
Config Store The configuration is partitioned in a number of Config Stores . Each Config Store is:
###Client application The client application interacts using the ConfigClient API . The ConfigClient maintains a set of ConfigStoreAccessor objects which interact through the ConfigStore API with the appropriate ConfigStore implementation depending on the scheme of the ConfigStore URI . There can be a native implementation of the API like the HadoopFS ConfigStore or an adapter to an existing config/metadata store like the Hive MetaStore, etc
###File System layout
###Example of a config store
ROOT ├── _CONFIG_STORE (contents = latest non-rolled-back version) └── 1.0.53 (version directory) ├── data │ └── tracking │ ├── TOPIC │ │ ├── includes (imports links) │ │ └── main.conf (configuration file) │ ├── includes │ └── main.conf └── tags ├── tracking │ └── retention │ └── LONG │ │ ├── includes │ │ └── main.conf │ └── main.conf └── acl └── restricted ├── main.conf └── secdata ├── includes └── main.conf