blob: 5ff8a59a705004215b2bf61f218b9b4ba07a64f7 [file] [view]
---
{
"title": "Multi-Dimensional Aggregation Analysis",
"language": "en",
"description": "How can you use ROLLUP, CUBE, and GROUPING SETS to perform multi-dimensional aggregation analysis in a single SQL statement? This article presents the syntax, examples, and a comparison of suitable scenarios.",
"keywords": [
"ROLLUP",
"CUBE",
"GROUPING SETS",
"GROUPING function",
"GROUPING_ID",
"multi-dimensional aggregation",
"GROUP BY extension",
"subtotals and grand total",
"hierarchical summary"
]
}
---
<!-- Knowledge type: Feature description + SQL syntax -->
<!-- Applicable scenario: Multi-dimensional reports, hierarchical summarization, cross-dimensional analysis -->
When producing sales reports, operational analyses, or market surveys, you often need a single query that returns both the "subtotals at the detail dimension" and the "totals across different dimension combinations" along with the "grand total." Stitching together multiple `GROUP BY` queries with `UNION ALL` not only makes the SQL verbose but also scans the base table multiple times, which is inefficient.
Doris provides three syntactic extensions to the `GROUP BY` clause for multi-dimensional aggregation. They produce summary results at multiple levels in a single SQL statement, and are semantically equivalent to using `UNION ALL` to join multiple aggregate queries:
| Syntax | Applicable scenario | Output summary combinations |
| --- | --- | --- |
| **ROLLUP** | Aggregate level by level along dimensions with a natural hierarchy, such as time, geography, or category | Aggregate progressively along the specified column order, from the finest granularity up to the grand total |
| **CUBE** | Multiple independent dimensions that need full cross-dimensional analysis | All combinations of all dimension subsets |
| **GROUPING SETS** | Only a few specific dimension combinations are of interest, avoiding the cost of a full CUBE | The grouping sets that the user explicitly specifies |
This article introduces these three syntaxes in the order of "scenario -> syntax -> example," and explains how the companion `GROUPING` and `GROUPING_ID` functions identify subtotal rows and distinguish between two kinds of NULL values.
## ROLLUP: Hierarchical Level-By-Level Summary
<!-- Knowledge type: SQL syntax -->
<!-- Applicable scenario: Dimensions with hierarchical relationships, such as time, geography, or category -->
### Applicable Scenario
`ROLLUP` is suitable for scenarios that aggregate hierarchical dimensions level by level. It aggregates along the specified column order, summarizing data progressively from the finest granularity up to the highest level. For example:
- Time dimension: `ROLLUP(year, month, day)`
- Geographic dimension: `ROLLUP(country, province, city)`
For sales data, you can use `ROLLUP` to aggregate by region and time, producing the monthly sales for each region, the total sales for each region, and the overall total sales.
### Syntax
```sql
SELECT … GROUP BY ROLLUP(grouping_column_reference_list)
```
### Example
The following query analyzes sales by year and month:
```sql
SELECT
YEAR(d_date),
MONTH(d_date),
SUM(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1
WHERE
d1.d_date_sk = ss_sold_date_sk
AND YEAR(d_date) IN (2001, 2002)
AND MONTH(d_date) IN (1, 2, 3)
GROUP BY
ROLLUP(YEAR(d_date), MONTH(d_date))
ORDER BY
YEAR(d_date), MONTH(d_date);
```
The query aggregates by time level by level, computing the monthly sales subtotal for each year, the annual sales subtotal for each year, and the overall sales grand total. The query result is as follows:
```sql
+--------------+---------------+-------------+
| YEAR(d_date) | MONTH(d_date) | total_sum |
+--------------+---------------+-------------+
| NULL | NULL | 54262669.17 |
| 2001 | NULL | 26640320.46 |
| 2001 | 1 | 9982165.83 |
| 2001 | 2 | 8454915.34 |
| 2001 | 3 | 8203239.29 |
| 2002 | NULL | 27622348.71 |
| 2002 | 1 | 11260654.35 |
| 2002 | 2 | 7722750.61 |
| 2002 | 3 | 8638943.75 |
+--------------+---------------+-------------+
9 rows in set (0.08 sec)
```
## CUBE: Full Cross-Dimensional Summary
<!-- Knowledge type: SQL syntax -->
<!-- Applicable scenario: Comprehensive multi-dimensional analysis across multiple independent dimensions -->
### Applicable Scenario
`CUBE` is best suited for queries that involve multiple independent dimension columns rather than columns that represent different levels of a single dimension. A common use case is summarizing all combinations of month, region, and product. These are three mutually independent dimensions, and analyzing every possible subtotal combination is very common.
By contrast, cross-tabulating year, month, and day produces many unnecessary values, because the time dimension itself has a natural hierarchy. In most analyses, subtotals such as "profit by month-day" are not needed, and relatively few users ask "what is the total sales on the 16th of every month across the year." Hierarchical dimensions like this are better handled with `ROLLUP`.
### Syntax
```sql
SELECT … GROUP BY CUBE(grouping_column_reference_list)
```
### Example
```sql
SELECT
YEAR(d_date),
i_category,
ca_state,
SUM(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ("Books", "Electronics")
AND YEAR(d_date) IN (1998, 1999)
AND ca_state IN ("LA", "AK")
GROUP BY CUBE(YEAR(d_date), i_category, ca_state)
ORDER BY YEAR(d_date), i_category, ca_state;
```
The query result computes:
- The grand total of sales.
- The sales subtotal for each year, the sales subtotal of items per category, and the sales subtotal per state.
- The sales subtotal per category for each year, the sales subtotal per product for each state, the sales subtotal per state for each year, and the sales subtotal of products per category in each state for each year.
```sql
+--------------+-------------+----------+------------+
| YEAR(d_date) | i_category | ca_state | total_sum |
+--------------+-------------+----------+------------+
| NULL | NULL | NULL | 8690374.60 |
| NULL | NULL | AK | 2675198.33 |
| NULL | NULL | LA | 6015176.27 |
| NULL | Books | NULL | 4238177.69 |
| NULL | Books | AK | 1310791.36 |
| NULL | Books | LA | 2927386.33 |
| NULL | Electronics | NULL | 4452196.91 |
| NULL | Electronics | AK | 1364406.97 |
| NULL | Electronics | LA | 3087789.94 |
| 1998 | NULL | NULL | 4369656.14 |
| 1998 | NULL | AK | 1402539.19 |
| 1998 | NULL | LA | 2967116.95 |
| 1998 | Books | NULL | 2213703.82 |
| 1998 | Books | AK | 719911.29 |
| 1998 | Books | LA | 1493792.53 |
| 1998 | Electronics | NULL | 2155952.32 |
| 1998 | Electronics | AK | 682627.90 |
| 1998 | Electronics | LA | 1473324.42 |
| 1999 | NULL | NULL | 4320718.46 |
| 1999 | NULL | AK | 1272659.14 |
| 1999 | NULL | LA | 3048059.32 |
| 1999 | Books | NULL | 2024473.87 |
| 1999 | Books | AK | 590880.07 |
| 1999 | Books | LA | 1433593.80 |
| 1999 | Electronics | NULL | 2296244.59 |
| 1999 | Electronics | AK | 681779.07 |
| 1999 | Electronics | LA | 1614465.52 |
+--------------+-------------+----------+------------+
27 rows in set (0.21 sec)
```
## GROUPING Function: Identifying Subtotal Rows
<!-- Knowledge type: SQL function -->
<!-- Applicable scenario: Distinguish subtotal rows from real NULLs and filter results by aggregation level -->
When using `ROLLUP` and `CUBE`, the result set raises two issues that need to be addressed:
1. **How to identify subtotal rows**: programmatically detect which result rows represent subtotals and pinpoint the aggregation level a given subtotal corresponds to. This is very common in scenarios such as computing "percentage of grand total."
2. **How to distinguish two kinds of NULL**: when the query result contains both NULL values actually stored in the table and NULL values produced by `ROLLUP` or `CUBE` operations, you need a way to tell them apart.
The `GROUPING` and `GROUPING_ID` functions (paired with `GROUPING SETS`) solve these problems effectively.
### GROUPING
#### Principle
`GROUPING` takes a single column as its argument:
- Returns `1` when the value is a NULL produced by a `ROLLUP` or `CUBE` operation (that is, the row is a subtotal row).
- Returns `0` for any other value, including NULL values that already exist in the table.
Example:
```sql
SELECT
year(d_date),
month(d_date),
sum(ss_net_paid) AS total_sum,
grouping(year(d_date)),
grouping(month(d_date))
FROM
store_sales,
date_dim d1
WHERE
d1.d_date_sk = ss_sold_date_sk
AND year(d_date) IN (2001, 2002)
AND month(d_date) IN (1, 2, 3)
GROUP BY
ROLLUP(year(d_date), month(d_date))
ORDER BY
year(d_date), month(d_date);
```
The values of the `GROUPING` function at different aggregation levels:
- `(year(d_date), month(d_date))` group: result is `(0, 0)`, the detail rows aggregated by year and month.
- `(year(d_date))` group: result is `(0, 1)`, the subtotal rows aggregated by year.
- `()` group: result is `(1, 1)`, the grand total row.
The query result is as follows:
```sql
+--------------+---------------+-------------+------------------------+-------------------------+
| year(d_date) | month(d_date) | total_sum | Grouping(year(d_date)) | Grouping(month(d_date)) |
+--------------+---------------+-------------+------------------------+-------------------------+
| NULL | NULL | 54262669.17 | 1 | 1 |
| 2001 | NULL | 26640320.46 | 0 | 1 |
| 2001 | 1 | 9982165.83 | 0 | 0 |
| 2001 | 2 | 8454915.34 | 0 | 0 |
| 2001 | 3 | 8203239.29 | 0 | 0 |
| 2002 | NULL | 27622348.71 | 0 | 1 |
| 2002 | 1 | 11260654.35 | 0 | 0 |
| 2002 | 2 | 7722750.61 | 0 | 0 |
| 2002 | 3 | 8638943.75 | 0 | 0 |
+--------------+---------------+-------------+------------------------+-------------------------+
9 rows in set (0.06 sec)
```
#### Usage 1: Filtering Aggregation Levels in HAVING
The `GROUPING` function can filter results to a specific aggregation level. The following example keeps only "the grand total of sales," "sales aggregated by year," and "sales aggregated by state":
```sql
SELECT
year(d_date),
i_category,
ca_state,
sum(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ("Books", "Electronics")
AND year(d_date) IN (1998, 1999)
AND ca_state IN ("LA", "AK")
GROUP BY CUBE(year(d_date), i_category, ca_state)
HAVING grouping(year(d_date)) = 1 AND grouping(i_category) = 1 AND grouping(ca_state) = 1
OR grouping(year(d_date)) = 0 AND grouping(i_category) = 1 AND grouping(ca_state) = 1
OR grouping(year(d_date)) = 1 AND grouping(i_category) = 1 AND grouping(ca_state) = 0
ORDER BY year(d_date), i_category, ca_state;
```
The query result is as follows:
```sql
+---------------------+------------+----------+------------+
| year(`d1`.`d_date`) | i_category | ca_state | total_sum |
+---------------------+------------+----------+------------+
| NULL | NULL | NULL | 8690374.60 |
| NULL | NULL | AK | 2675198.33 |
| NULL | NULL | LA | 6015176.27 |
| 1998 | NULL | NULL | 4369656.14 |
| 1999 | NULL | NULL | 4320718.46 |
+---------------------+------------+----------+------------+
5 rows in set (0.13 sec)
```
#### Usage 2: Combining With IF to Improve Readability
Replacing the NULL values in subtotal rows with more intuitive strings makes the result easier to read:
```sql
SELECT
IF(grouping(year(d_date)) = 1, "Multi-year sum", year(d_date)) AS year,
IF(grouping(i_category) = 1, "Multi-category sum", i_category) AS category,
sum(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ("Books", "Electronics")
AND year(d_date) IN (1998, 1999)
AND ca_state IN ("LA", "AK")
GROUP BY CUBE(year(d_date), i_category)
```
The query result is as follows:
```sql
+----------------+--------------------+------------+
| year | category | total_sum |
+----------------+--------------------+------------+
| 1998 | Books | 2213703.82 |
| 1998 | Electronics | 2155952.32 |
| 1999 | Electronics | 2296244.59 |
| 1999 | Books | 2024473.87 |
| 1998 | Multi-category sum | 4369656.14 |
| 1999 | Multi-category sum | 4320718.46 |
| Multi-year sum | Books | 4238177.69 |
| Multi-year sum | Electronics | 4452196.91 |
| Multi-year sum | Multi-category sum | 8690374.60 |
+----------------+--------------------+------------+
9 rows in set (0.09 sec)
```
### GROUPING_ID
#### Applicable Scenario
`GROUPING_ID` and `GROUPING` are both used to assist with multi-dimensional aggregation queries (such as `ROLLUP` and `CUBE`), helping you distinguish between aggregation results at different levels.
To identify the aggregation level a row belongs to with `GROUPING`, you must compute the function on every `GROUP BY` column individually (a single-column result is not enough to distinguish levels), which makes the SQL verbose. `GROUPING_ID` is more powerful than `GROUPING`: it accepts multiple columns as arguments and returns an integer whose binary bits represent the aggregation states of all those columns at once.
When the computed results are stored in a table or materialized view, using the `GROUPING` function to represent different aggregation levels takes up more storage. `GROUPING_ID` is more appropriate in this scenario.
Taking `CUBE(a, b)` as an example, the correspondence between `GROUPING_ID` and `GROUPING` is:
| Aggregation level | Bit Vector | GROUPING_ID | GROUPING(a) | GROUPING(b) |
| ----------- | ---------- | ----------- | ----------- | ----------- |
| a, b | 0 0 | 0 | 0 | 0 |
| a | 0 1 | 1 | 0 | 1 |
| b | 1 0 | 2 | 1 | 0 |
| Grand Total | 1 1 | 3 | 1 | 1 |
#### Syntax and Example
```sql
SELECT
year(d_date),
i_category,
SUM(ss_net_paid) AS total_sum,
GROUPING(year(d_date)),
GROUPING(i_category),
GROUPING_ID(year(d_date), i_category)
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ('Books', 'Electronics')
AND year(d_date) IN (1998, 1999)
AND ca_state IN ('LA', 'AK')
GROUP BY CUBE(year(d_date), i_category);
```
The query result is as follows:
```sql
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
| year(d_date) | i_category | total_sum | GROUPING(year(d_date)) | GROUPING(i_category) | GROUPING_ID(year(d_date), i_category) |
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
| 1998 | Electronics | 2155952.32 | 0 | 0 | 0 |
| 1998 | Books | 2213703.82 | 0 | 0 | 0 |
| 1999 | Electronics | 2296244.59 | 0 | 0 | 0 |
| 1999 | Books | 2024473.87 | 0 | 0 | 0 |
| 1998 | NULL | 4369656.14 | 0 | 1 | 1 |
| 1999 | NULL | 4320718.46 | 0 | 1 | 1 |
| NULL | Electronics | 4452196.91 | 1 | 0 | 2 |
| NULL | Books | 4238177.69 | 1 | 0 | 2 |
| NULL | NULL | 8690374.60 | 1 | 1 | 3 |
+--------------+-------------+------------+------------------------+----------------------+---------------------------------------+
9 rows in set (0.12 sec)
```
## GROUPING SETS: Specifying Grouping Combinations Precisely
<!-- Knowledge type: SQL syntax -->
<!-- Applicable scenario: Only a few specific dimension combinations are of interest, avoiding the cost of a full CUBE -->
### Applicable Scenario
When you need to aggregate only over a few specified grouping combinations rather than compute a full `CUBE`, use `GROUPING SETS` in the `GROUP BY` clause. It lets you specify exactly which combinations across multiple dimensions to compute, avoiding unnecessary overhead.
Because `CUBE` queries usually consume more resources, `GROUPING SETS` improves execution efficiency when only a few dimensions are of interest.
### Syntax
```sql
SELECT … GROUP BY GROUPING SETS(grouping_column_reference_list)
```
### Example
Suppose you need:
- The sales subtotal of each product category for each year
- The sales subtotal in each state for each year
- The sales subtotal of each product in each state for each year
You can use `GROUPING SETS` to specify these dimension combinations explicitly:
```sql
SELECT
YEAR(d_date),
i_category,
ca_state,
SUM(ss_net_paid) AS total_sum
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ('Books', 'Electronics')
AND YEAR(d_date) IN (1998, 1999)
AND ca_state IN ('LA', 'AK')
GROUP BY GROUPING SETS(
(YEAR(d_date), i_category),
(YEAR(d_date), ca_state),
(YEAR(d_date), ca_state, i_category)
)
ORDER BY YEAR(d_date), i_category, ca_state;
```
The query result is as follows:
```sql
+--------------+-------------+----------+------------+
| YEAR(d_date) | i_category | ca_state | total_sum |
+--------------+-------------+----------+------------+
| 1998 | NULL | AK | 1402539.19 |
| 1998 | NULL | LA | 2967116.95 |
| 1998 | Books | NULL | 2213703.82 |
| 1998 | Books | AK | 719911.29 |
| 1998 | Books | LA | 1493792.53 |
| 1998 | Electronics | NULL | 2155952.32 |
| 1998 | Electronics | AK | 682627.90 |
| 1998 | Electronics | LA | 1473324.42 |
| 1999 | NULL | AK | 1272659.14 |
| 1999 | NULL | LA | 3048059.32 |
| 1999 | Books | NULL | 2024473.87 |
| 1999 | Books | AK | 590880.07 |
| 1999 | Books | LA | 1433593.80 |
| 1999 | Electronics | NULL | 2296244.59 |
| 1999 | Electronics | AK | 681779.07 |
| 1999 | Electronics | LA | 1614465.52 |
+--------------+-------------+----------+------------+
16 rows in set (0.11 sec)
```
The query above is equivalent to running `CUBE` and then using `grouping_id` to keep only the specified aggregation combinations, which avoids unnecessary computation:
```sql
SELECT
SUM(ss_net_paid) AS total_sum,
YEAR(d_date),
i_category,
ca_state
FROM
store_sales,
date_dim d1,
item,
customer_address ca
WHERE
d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk
AND ss_addr_sk = ca_address_sk
AND i_category IN ('Books', 'Electronics')
AND YEAR(d_date) IN (1998, 1999)
AND ca_state IN ('LA', 'AK')
GROUP BY CUBE(YEAR(d_date), ca_state, i_category)
HAVING grouping_id(YEAR(d_date), ca_state, i_category) = 0
OR grouping_id(YEAR(d_date), ca_state, i_category) = 2
OR grouping_id(YEAR(d_date), ca_state, i_category) = 1;
```
:::info Note
`CUBE` computes every possible aggregation level (eight in this example), but in practice you may only be interested in a few of them.
:::
### Semantic Equivalences
`GROUPING SETS` is the more fundamental form of multi-dimensional aggregation. Both `ROLLUP` and `CUBE` can be expanded into `GROUPING SETS`.
#### GROUPING SETS and GROUP BY UNION ALL
The following `GROUPING SETS` statement:
```sql
SELECT k1, k2, SUM(k3) FROM t GROUP BY GROUPING SETS ((k1, k2), (k1), (k2), ());
```
Is equivalent to multiple `GROUP BY` queries joined by `UNION ALL`:
```sql
SELECT k1, k2, SUM(k3) FROM t GROUP BY k1, k2
UNION ALL
SELECT k1, NULL, SUM(k3) FROM t GROUP BY k1
UNION ALL
SELECT NULL, k2, SUM(k3) FROM t GROUP BY k2
UNION ALL
SELECT NULL, NULL, SUM(k3) FROM t;
```
The query joined by `UNION ALL` is longer and scans the base table multiple times, so it is less efficient both to write and to execute.
#### GROUPING SETS and ROLLUP
`ROLLUP` is an extension of `GROUPING SETS`. For example:
```sql
SELECT a, b, c, SUM(d) FROM tab1 GROUP BY ROLLUP(a, b, c);
```
Is equivalent to the following `GROUPING SETS`:
```sql
GROUPING SETS (
(a, b, c),
(a, b),
(a),
()
);
```
#### GROUPING SETS and CUBE
`CUBE(a, b, c)` is equivalent to the following `GROUPING SETS`:
```sql
GROUPING SETS (
(a, b, c),
(a, b),
(a, c),
(a),
(b, c),
(b),
(c),
()
);
```
## Appendix
For the table creation statements and data files, see the appendix of [Analytic Functions (Window Functions)](./window-function.md).