This Apache Druid (incubating) module extends Druid Hadoop based indexing to ingest data directly from offline Apache Parquet files.
Note: druid-parquet-extensions depends on the druid-avro-extensions module, so be sure to include both.
This extension provides two ways to parse Parquet files:
parquet - using a simple conversion contained within this extensionparquet-avro - conversion to avro records with the parquet-avro library and using the druid-avro-extensions module to parse the avro dataSelection of conversion method is controlled by parser type, and the correct hadoop input format must also be set in the ioConfig, org.apache.druid.data.input.parquet.DruidParquetInputFormat for parquet and org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat for parquet-avro.
Both parse options support auto field discovery and flattening if provided with a flattenSpec with parquet or avro as the format. Parquet nested list and map logical types should operate correctly with json path expressions for all supported types. parquet-avro sets a hadoop job property parquet.avro.add-list-element-records to false (which normally defaults to true), in order to ‘unwrap’ primitive list elements into multi-value dimensions.
The parquet parser supports int96 Parquet values, while parquet-avro does not. There may also be some subtle differences in the behavior of json path expression evaluation of flattenSpec.
We suggest using parquet over parquet-avro to allow ingesting data beyond the schema constraints of Avro conversion. However, parquet-avro was the original basis for this extension, and as such it is a bit more mature.
| Field | Type | Description | Required |
|---|---|---|---|
| type | String | Choose parquet or parquet-avro to determine how Parquet files are parsed | yes |
| parseSpec | JSON Object | Specifies the timestamp and dimensions of the data, and optionally, a flatten spec. Valid parseSpec formats are timeAndDims, parquet, avro (if used with avro conversion). | yes |
| binaryAsString | Boolean | Specifies if the bytes parquet column which is not logically marked as a string or enum type should be converted to strings anyway. | no(default == false) |
When the time dimension is a DateType column, a format should not be supplied. When the format is UTF8 (String), either auto or a explicitly defined format is required.
parquet parser, parquet parseSpec{ "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", "paths": "path/to/file.parquet" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "parquet", "parseSpec": { "format": "parquet", "flattenSpec": { "useFieldDiscovery": true, "fields": [ { "type": "path", "name": "nestedDim", "expr": "$.nestedData.dim1" }, { "type": "path", "name": "listDimFirstItem", "expr": "$.listDim[1]" } ] }, "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": [], "dimensionExclusions": [], "spatialDimensions": [] } } }, ... }, "tuningConfig": <hadoop-tuning-config> } } }
parquet parser, timeAndDims parseSpec{ "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetInputFormat", "paths": "path/to/file.parquet" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "parquet", "parseSpec": { "format": "timeAndDims", "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": [ "dim1", "dim2", "dim3", "listDim" ], "dimensionExclusions": [], "spatialDimensions": [] } } }, ... }, "tuningConfig": <hadoop-tuning-config> } }
parquet-avro parser, avro parseSpec{ "type": "index_hadoop", "spec": { "ioConfig": { "type": "hadoop", "inputSpec": { "type": "static", "inputFormat": "org.apache.druid.data.input.parquet.DruidParquetAvroInputFormat", "paths": "path/to/file.parquet" }, ... }, "dataSchema": { "dataSource": "example", "parser": { "type": "parquet-avro", "parseSpec": { "format": "avro", "flattenSpec": { "useFieldDiscovery": true, "fields": [ { "type": "path", "name": "nestedDim", "expr": "$.nestedData.dim1" }, { "type": "path", "name": "listDimFirstItem", "expr": "$.listDim[1]" } ] }, "timestampSpec": { "column": "timestamp", "format": "auto" }, "dimensionsSpec": { "dimensions": [], "dimensionExclusions": [], "spatialDimensions": [] } } }, ... }, "tuningConfig": <hadoop-tuning-config> } } }
For additional details see hadoop ingestion and general ingestion spec documentation.