This Apache Druid (incubating) module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.
Note: druid-parquet-extensions
depends on the druid-avro-extensions
module, so be sure to include both.
This extension provides two ways to parse Parquet files:
parquet
- using a simple conversion contained within this extensionparquet-avro
- conversion to avro records with the parquet-avro
library and using the druid-avro-extensions
module to parse the avro dataSelection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in the ioConfig
:
org.apache.druid.data.input.parquet.DruidParquetInputFormat
for parquet
org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat
for parquet-avro
Both parse options support auto field discovery and flattening if provided with a flattenSpec with parquet
or avro
as the format. Parquet nested list and map logical types should operate correctly with json path expressions for all supported types. parquet-avro
sets a hadoop job property parquet.avro.add-list-element-records
to false
(which normally defaults to true
), in order to ‘unwrap’ primitive list elements into multi-value dimensions.
The parquet
parser supports int96
Parquet values, while parquet-avro
does not. There may also be some subtle differences in the behavior of json path expression evaluation of flattenSpec
.
We suggest using parquet
over parquet-avro
to allow ingesting data beyond the schema constraints of Avro conversion. However, parquet-avro
was the original basis for this extension, and as such it is a bit more mature.
Field | Type | Description | Required |
---|---|---|---|
type | String | Choose parquet or parquet-avro to determine how Parquet files are parsed | yes |
parseSpec | JSON Object | Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are timeAndDims , parquet , avro (if used with avro conversion). | yes |
binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be converted to strings anyway. | no(default == false) |
When the time dimension is a DateType column, a format should not be supplied. When the format is UTF8 (String), either auto
or a explicitly defined format is required.
parquet
parser, parquet
parseSpec{ "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", "paths": "path/to/file.parquet" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "parquet", "parseSpec": { "format": "parquet", "flattenSpec": { "useFieldDiscovery": true, "fields": [ { "type": "path", "name": "nestedDim", "expr": "$.nestedData.dim1" }, { "type": "path", "name": "listDimFirstItem", "expr": "$.listDim[1]" } ] }, "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": [], "dimensionExclusions": [], "spatialDimensions": [] } } }, ... }, "tuningConfig": <hadoop-tuning-config> } } }
parquet
parser, timeAndDims
parseSpec{ "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", "paths": "path/to/file.parquet" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "parquet", "parseSpec": { "format": "timeAndDims", "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": [ "dim1", "dim2", "dim3", "listDim" ], "dimensionExclusions": [], "spatialDimensions": [] } } }, ... }, "tuningConfig": <hadoop-tuning-config> } }
parquet-avro
parser, avro
parseSpec{ "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat", "paths": "path/to/file.parquet" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "parquet-avro", "parseSpec": { "format": "avro", "flattenSpec": { "useFieldDiscovery": true, "fields": [ { "type": "path", "name": "nestedDim", "expr": "$.nestedData.dim1" }, { "type": "path", "name": "listDimFirstItem", "expr": "$.listDim[1]" } ] }, "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": [], "dimensionExclusions": [], "spatialDimensions": [] } } }, ... }, "tuningConfig": <hadoop-tuning-config> } } }
For additional details see hadoop ingestion and general ingestion spec documentation.