| --- |
| id: multi-value-dimensions |
| title: "Multi-value dimensions" |
| --- |
| |
| <!-- |
| ~ Licensed to the Apache Software Foundation (ASF) under one |
| ~ or more contributor license agreements. See the NOTICE file |
| ~ distributed with this work for additional information |
| ~ regarding copyright ownership. The ASF licenses this file |
| ~ to you under the Apache License, Version 2.0 (the |
| ~ "License"); you may not use this file except in compliance |
| ~ with the License. You may obtain a copy of the License at |
| ~ |
| ~ http://www.apache.org/licenses/LICENSE-2.0 |
| ~ |
| ~ Unless required by applicable law or agreed to in writing, |
| ~ software distributed under the License is distributed on an |
| ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| ~ KIND, either express or implied. See the License for the |
| ~ specific language governing permissions and limitations |
| ~ under the License. |
| --> |
| |
| |
| Apache Druid supports "multi-value" string dimensions. Multi-value string dimensions result from input fields that contain an |
| array of values instead of a single value, such as the `tags` values in the following JSON array example: |
| |
| ``` |
| {"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]} |
| ``` |
| |
| It is important to be aware that multi-value dimensions are distinct from [array types](arrays.md). While array types behave like standard SQL arrays, multi-value dimensions do not. This document describes the behavior of multi-value dimensions, and some additional details can be found in the [SQL data type documentation](sql-data-types.md#multi-value-strings-behavior). |
| |
| This document describes inserting, filtering, and grouping behavior for multi-value dimensions. For information about the internal representation of multi-value dimensions, see |
| [segments documentation](../design/segments.md#multi-value-columns). Examples in this document |
| are in the form of both [SQL](sql.md) and [native Druid queries](querying.md). Refer to the [Druid SQL documentation](sql-multivalue-string-functions.md) for details |
| about the functions available for using multi-value string dimensions in SQL. |
| |
| The following sections describe inserting, filtering, and grouping behavior based on the following example data, which includes a multi-value dimension, `tags`. |
| |
| ```json lines |
| {"timestamp": "2011-01-12T00:00:00.000Z", "label": "row1", "tags": ["t1","t2","t3"]} |
| {"timestamp": "2011-01-13T00:00:00.000Z", "label": "row2", "tags": ["t3","t4","t5"]} |
| {"timestamp": "2011-01-14T00:00:00.000Z", "label": "row3", "tags": ["t5","t6","t7"]} |
| {"timestamp": "2011-01-14T00:00:00.000Z", "label": "row4", "tags": []} |
| ``` |
| |
| ## Ingestion |
| |
| ### Native batch and streaming ingestion |
| When using native [batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), the Druid web console data loader can detect multi-value dimensions and configure the `dimensionsSpec` accordingly. |
| |
| For TSV or CSV data, you can specify the multi-value delimiters using the `listDelimiter` field in the `inputFormat`. JSON data must be formatted as a JSON array to be ingested as a multi-value dimension. JSON data does not require `inputFormat` configuration. |
| |
| The following shows an example `dimensionsSpec` for native ingestion of the data used in this document: |
| |
| ``` |
| "dimensions": [ |
| { |
| "type": "string", |
| "name": "label" |
| }, |
| { |
| "type": "string", |
| "name": "tags", |
| "multiValueHandling": "SORTED_ARRAY", |
| "createBitmapIndex": true |
| } |
| ], |
| ``` |
| |
| By default, Druid sorts values in multi-value dimensions. This behavior is controlled by the `SORTED_ARRAY` value of the `multiValueHandling` field. Alternatively, you can specify multi-value handling as: |
| |
| * `SORTED_SET`: results in the removal of duplicate values |
| * `ARRAY`: retains the original order of the values |
| |
| See [Dimension Objects](../ingestion/ingestion-spec.md#dimension-objects) for information on configuring multi-value handling. |
| |
| ### SQL-based ingestion |
| Multi-value dimensions can also be inserted with [SQL-based ingestion](../multi-stage-query/index.md). The functions `MV_TO_ARRAY` and `ARRAY_TO_MV` can assist in converting `VARCHAR` to `VARCHAR ARRAY` and `VARCHAR ARRAY` into `VARCHAR` respectively. `multiValueHandling` is not available when using the multi-stage query engine to insert data. |
| |
| For example, to insert the data used in this document: |
| ```sql |
| REPLACE INTO "mvd_example" OVERWRITE ALL |
| WITH "ext" AS ( |
| SELECT * |
| FROM TABLE( |
| EXTERN( |
| '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}', |
| '{"type":"json"}', |
| '[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]' |
| ) |
| ) |
| ) |
| SELECT |
| TIME_PARSE("timestamp") AS "__time", |
| "label", |
| ARRAY_TO_MV("tags") AS "tags" |
| FROM "ext" |
| PARTITIONED BY DAY |
| ``` |
| |
| ### SQL-based ingestion with rollup |
| These input arrays can also be grouped prior to converting into a multi-value dimension: |
| ```sql |
| REPLACE INTO "mvd_example_rollup" OVERWRITE ALL |
| WITH "ext" AS ( |
| SELECT * |
| FROM TABLE( |
| EXTERN( |
| '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}', |
| '{"type":"json"}', |
| '[{"name":"timestamp", "type":"STRING"},{"name":"label", "type":"STRING"},{"name":"tags", "type":"ARRAY<STRING>"}]' |
| ) |
| ) |
| ) |
| SELECT |
| TIME_PARSE("timestamp") AS "__time", |
| "label", |
| ARRAY_TO_MV("tags") AS "tags", |
| COUNT(*) AS "count" |
| FROM "ext" |
| GROUP BY 1, 2, "tags" |
| PARTITIONED BY DAY |
| ``` |
| |
| Notice that `ARRAY_TO_MV` is not present in the `GROUP BY` clause since we only wish to coerce the type _after_ grouping. |
| |
| |
| The `EXTERN` is also able to refer to the `tags` input type as `VARCHAR`, which is also how a query on a Druid table containing a multi-value dimension would specify the type of the `tags` column. If this is the case you must use `MV_TO_ARRAY` since the multi-stage query engine only supports grouping on multi-value dimensions as arrays. So, they must be coerced first. These arrays must then be coerced back into `VARCHAR` in the `SELECT` part of the statement with `ARRAY_TO_MV`. |
| |
| ```sql |
| REPLACE INTO "mvd_example_rollup" OVERWRITE ALL |
| WITH "ext" AS ( |
| SELECT * |
| FROM TABLE( |
| EXTERN( |
| '{"type":"inline","data":"{\"timestamp\": \"2011-01-12T00:00:00.000Z\", \"label\": \"row1\", \"tags\": [\"t1\",\"t2\",\"t3\"]}\n{\"timestamp\": \"2011-01-13T00:00:00.000Z\", \"label\": \"row2\", \"tags\": [\"t3\",\"t4\",\"t5\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row3\", \"tags\": [\"t5\",\"t6\",\"t7\"]}\n{\"timestamp\": \"2011-01-14T00:00:00.000Z\", \"label\": \"row4\", \"tags\": []}"}', |
| '{"type":"json"}' |
| ) |
| ) EXTEND ("timestamp" VARCHAR, "label" VARCHAR, "tags" VARCHAR) |
| ) |
| SELECT |
| TIME_PARSE("timestamp") AS "__time", |
| "label", |
| ARRAY_TO_MV(MV_TO_ARRAY("tags")) AS "tags", |
| COUNT(*) AS "count" |
| FROM "ext" |
| GROUP BY 1, 2, MV_TO_ARRAY("tags") |
| PARTITIONED BY DAY |
| ``` |
| |
| ## Querying multi-value dimensions |
| |
| ### Filtering |
| |
| All query types, as well as [filtered aggregators](aggregations.md#filtered-aggregator), can filter on multi-value |
| dimensions. Filters follow these rules on multi-value dimensions: |
| |
| - Value filters (like "selector", "bound", and "in") match a row if any of the values of a multi-value dimension match |
| the filter. |
| - The Column Comparison filter will match a row if the dimensions have any overlap. |
| - Value filters that match `null` or `""` (empty string) will match empty cells in a multi-value dimension. |
| - Logical expression filters behave the same way they do on single-value dimensions: "and" matches a row if all |
| underlying filters match that row; "or" matches a row if any underlying filters match that row; "not" matches a row |
| if the underlying filter does not match the row. |
| |
| The following example illustrates these rules. This query applies an "or" filter to match row1 and row2 of the dataset above, but not row3: |
| |
| ```sql |
| SELECT * |
| FROM "mvd_example_rollup" |
| WHERE tags = 't1' OR tags = 't3' |
| ``` |
| |
| returns |
| ```json lines |
| {"__time":"2011-01-12T00:00:00.000Z","label":"row1","tags":"[\"t1\",\"t2\",\"t3\"]","count":1} |
| {"__time":"2011-01-13T00:00:00.000Z","label":"row2","tags":"[\"t3\",\"t4\",\"t5\"]","count":1} |
| ``` |
| |
| Native queries can also perform filtering that would be considered a "contradiction" in SQL, such as this "and" filter which would match only row1 of the dataset above: |
| |
| ``` |
| { |
| "type": "and", |
| "fields": [ |
| { |
| "type": "selector", |
| "dimension": "tags", |
| "value": "t1" |
| }, |
| { |
| "type": "selector", |
| "dimension": "tags", |
| "value": "t3" |
| } |
| ] |
| } |
| ``` |
| |
| which returns |
| ```json lines |
| {"__time":"2011-01-12T00:00:00.000Z","label":"row1","tags":"[\"t1\",\"t2\",\"t3\"]","count":1} |
| ``` |
| |
| Multi-value dimensions also consider an empty row as `null`, consider: |
| ```sql |
| SELECT * |
| FROM "mvd_example_rollup" |
| WHERE tags is null |
| ``` |
| |
| which results in: |
| ```json lines |
| {"__time":"2011-01-14T00:00:00.000Z","label":"row4","tags":null,"count":1} |
| ``` |
| |
| ### Grouping |
| |
| When grouping on a multi-value dimension with SQL or a native [topN](topnquery.md) or [groupBy](groupbyquery.md) queries, _all_ values |
| from matching rows will be used to generate one group per value. This behaves similarly to an implicit SQL `UNNEST` |
| operation. This means it's possible for a query to return more groups than there are rows. For example, a topN on the |
| dimension `tags` with filter `"t1" AND "t3"` would match only row1, and generate a result with three groups: |
| `t1`, `t2`, and `t3`. |
| |
| If you only need to include values that match your filter, you can use the SQL functions [`MV_FILTER_ONLY`/`MV_FILTER_NONE`](sql-multivalue-string-functions.md), |
| [filtered virtual column](virtual-columns.md#list-filtered-virtual-column), or [filtered dimensionSpec](dimensionspecs.md#filtered-dimensionspecs). This can also improve performance. |
| |
| #### Example: SQL grouping query with no filtering |
| ```sql |
| SELECT label, tags |
| FROM "mvd_example_rollup" |
| GROUP BY 1,2 |
| ``` |
| results in: |
| ```json lines |
| {"label":"row1","tags":"t1"} |
| {"label":"row1","tags":"t2"} |
| {"label":"row1","tags":"t3"} |
| {"label":"row2","tags":"t3"} |
| {"label":"row2","tags":"t4"} |
| {"label":"row2","tags":"t5"} |
| {"label":"row3","tags":"t5"} |
| {"label":"row3","tags":"t6"} |
| {"label":"row3","tags":"t7"} |
| {"label":"row4","tags":null} |
| ``` |
| |
| #### Example: SQL grouping query with a filter |
| ```sql |
| SELECT label, tags |
| FROM "mvd_example_rollup" |
| WHERE label in ('row1','row2') |
| GROUP BY 1,2 |
| ``` |
| |
| results: |
| ```json lines |
| {"label":"row1","tags":"t1"} |
| {"label":"row1","tags":"t2"} |
| {"label":"row1","tags":"t3"} |
| {"label":"row2","tags":"t3"} |
| {"label":"row2","tags":"t4"} |
| {"label":"row2","tags":"t5"} |
| ``` |
| |
| #### Example: native GroupBy query with no filtering |
| |
| See [GroupBy querying](groupbyquery.md) for details. |
| |
| ```json |
| { |
| "queryType": "groupBy", |
| "dataSource": "test", |
| "intervals": [ |
| "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" |
| ], |
| "granularity": { |
| "type": "all" |
| }, |
| "dimensions": [ |
| { |
| "type": "default", |
| "dimension": "tags", |
| "outputName": "tags" |
| } |
| ], |
| "aggregations": [ |
| { |
| "type": "count", |
| "name": "count" |
| } |
| ] |
| } |
| ``` |
| |
| This query returns the following result: |
| |
| ```json |
| [ |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 1, |
| "tags": "t1" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 1, |
| "tags": "t2" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 2, |
| "tags": "t3" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 1, |
| "tags": "t4" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 2, |
| "tags": "t5" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 1, |
| "tags": "t6" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 1, |
| "tags": "t7" |
| } |
| } |
| ] |
| ``` |
| |
| Notice that original rows are "exploded" into multiple rows and merged. |
| |
| #### Example: native GroupBy query with a selector query filter |
| |
| See [query filters](filters.md) for details of selector query filter. |
| |
| ```json |
| { |
| "queryType": "groupBy", |
| "dataSource": "test", |
| "intervals": [ |
| "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" |
| ], |
| "filter": { |
| "type": "selector", |
| "dimension": "tags", |
| "value": "t3" |
| }, |
| "granularity": { |
| "type": "all" |
| }, |
| "dimensions": [ |
| { |
| "type": "default", |
| "dimension": "tags", |
| "outputName": "tags" |
| } |
| ], |
| "aggregations": [ |
| { |
| "type": "count", |
| "name": "count" |
| } |
| ] |
| } |
| ``` |
| |
| This query returns the following result: |
| |
| ```json |
| [ |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 1, |
| "tags": "t1" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 1, |
| "tags": "t2" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 2, |
| "tags": "t3" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 1, |
| "tags": "t4" |
| } |
| }, |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 1, |
| "tags": "t5" |
| } |
| } |
| ] |
| ``` |
| |
| You might be surprised to see "t1", "t2", "t4" and "t5" included in the results. This is because the query filter is |
| applied on the row before explosion. For multi-value dimensions, a filter for value "t3" would match row1 and row2, |
| after which exploding is done. For multi-value dimensions, a query filter matches a row if any individual value inside |
| the multiple values matches the query filter. |
| |
| #### Example: native GroupBy query with selector query and dimension filters |
| |
| To solve the problem above and to get only rows for "t3", use a "filtered dimension spec", as in the query below. |
| |
| See filtered `dimensionSpecs` in [dimensionSpecs](dimensionspecs.md#filtered-dimensionspecs) for details. |
| |
| ```json |
| { |
| "queryType": "groupBy", |
| "dataSource": "test", |
| "intervals": [ |
| "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z" |
| ], |
| "filter": { |
| "type": "selector", |
| "dimension": "tags", |
| "value": "t3" |
| }, |
| "granularity": { |
| "type": "all" |
| }, |
| "dimensions": [ |
| { |
| "type": "listFiltered", |
| "delegate": { |
| "type": "default", |
| "dimension": "tags", |
| "outputName": "tags" |
| }, |
| "values": ["t3"] |
| } |
| ], |
| "aggregations": [ |
| { |
| "type": "count", |
| "name": "count" |
| } |
| ] |
| } |
| ``` |
| |
| This query returns the following result: |
| |
| ```json |
| [ |
| { |
| "timestamp": "1970-01-01T00:00:00.000Z", |
| "event": { |
| "count": 2, |
| "tags": "t3" |
| } |
| } |
| ] |
| ``` |
| |
| Note that, for groupBy queries, you could get similar result with a [having spec](having.md) but using a filtered |
| `dimensionSpec` is much more efficient because that gets applied at the lowest level in the query processing pipeline. |
| Having specs are applied at the outermost level of groupBy query processing. |
| |
| ## Disable GroupBy on multi-value columns |
| |
| You can disable the implicit unnesting behavior for groupBy by setting `groupByEnableMultiValueUnnesting: false` in your |
| [query context](query-context-reference.md). In this mode, the groupBy engine will return an error instead of completing the query. This is a safety |
| feature for situations where you believe that all dimensions are singly-valued and want the engine to reject any |
| multi-valued dimensions that were inadvertently included. |
| |
| ## Differences between arrays and multi-value dimensions |
| Avoid confusing string arrays with [multi-value dimensions](multi-value-dimensions.md). Arrays and multi-value dimensions are stored in different column types, and query behavior is different. You can use the functions `MV_TO_ARRAY` and `ARRAY_TO_MV` to convert between the two if needed. In general, we recommend using arrays whenever possible, since they are a newer and more powerful feature and have SQL compliant behavior. |
| |
| Use care during ingestion to ensure you get the type you want. |
| |
| To get arrays when performing an ingestion using JSON ingestion specs, such as [native batch](../ingestion/native-batch.md) or streaming ingestion such as with [Apache Kafka](../ingestion/kafka-ingestion.md), use dimension type `auto` or enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), write a query that generates arrays. Arrays may contain strings or numbers. |
| |
| To get multi-value dimensions when performing an ingestion using JSON ingestion specs, use dimension type `string` and do not enable `useSchemaDiscovery`. When performing a [SQL-based ingestion](../multi-stage-query/index.md), wrap arrays in [`ARRAY_TO_MV`](multi-value-dimensions.md#sql-based-ingestion). Multi-value dimensions can only contain strings. |
| |
| You can tell which type you have by checking the `INFORMATION_SCHEMA.COLUMNS` table, using a query like: |
| |
| ```sql |
| SELECT COLUMN_NAME, DATA_TYPE |
| FROM INFORMATION_SCHEMA.COLUMNS |
| WHERE TABLE_NAME = 'mytable' |
| ``` |
| |
| Arrays are type `ARRAY`, multi-value strings are type `VARCHAR`. |