A Dictionary is a special data structure provided by Doris to accelerate JOIN operations. It is built on top of a regular table, treats the corresponding columns of the source table as a key-value relationship, and preloads the full data of these columns into memory to enable fast lookup operations, thereby improving query performance.
Dictionaries are especially suitable for scenarios that require frequent key-value lookups. As a key-value lookup solution, dictionaries do not allow duplicate keys.
:::tip This feature is experimental and is supported starting from version 4.1.0. :::
Dictionaries are mainly suitable for the following scenarios:
Key-value lookups that originally had to be implemented with a LEFT OUTER JOIN can completely eliminate the JOIN overhead with the help of a dictionary, turning into a regular function call.
In an e-commerce system, the order table (orders, the fact table) records a large amount of transaction data and frequently needs to be joined with the product table (products, the dimension table) to obtain detailed product information.
Step 1: Create the base fact table and dimension table, and insert sample data
-- Product dimension table CREATE TABLE products ( product_id BIGINT NOT NULL COMMENT "Product ID", product_name VARCHAR(128) NOT NULL COMMENT "Product name", brand_name VARCHAR(64) NOT NULL COMMENT "Brand name", category_name VARCHAR(64) NOT NULL COMMENT "Category name", retail_price DECIMAL(10,2) NOT NULL COMMENT "Retail price", update_time DATETIME NOT NULL COMMENT "Update time" ) DISTRIBUTED BY HASH(`product_id`) BUCKETS 10; -- Order fact table CREATE TABLE orders ( order_id BIGINT NOT NULL COMMENT "Order ID", product_id BIGINT NOT NULL COMMENT "Product ID", user_id BIGINT NOT NULL COMMENT "User ID", quantity INT NOT NULL COMMENT "Purchase quantity", actual_price DECIMAL(10,2) NOT NULL COMMENT "Actual transaction price", order_time DATETIME NOT NULL COMMENT "Order time" ) DISTRIBUTED BY HASH(`order_id`) BUCKETS 32; -- Insert sample data INSERT INTO products VALUES (1001, 'iPhone 15 Pro 256G Black', 'Apple', 'Mobile & Digital', 8999.00, '2024-01-01 00:00:00'), (1002, 'MacBook Pro M3 Max', 'Apple', 'Computers & Office', 19999.00, '2024-01-01 00:00:00'), (1003, 'AirPods Pro 2', 'Apple', 'Phone Accessories', 1999.00, '2024-01-01 00:00:00'); INSERT INTO orders VALUES (10001, 1001, 88001, 1, 8899.00, '2024-02-22 10:15:00'), (10002, 1002, 88002, 1, 19599.00, '2024-02-22 11:30:00'), (10003, 1003, 88001, 2, 1899.00, '2024-02-22 14:20:00');
Step 2: Traditional JOIN approach
To count the number of orders and total sales for each category, you previously had to use a LEFT OUTER JOIN to retrieve product information from the product table:
-- Count the number of orders and total sales for each category SELECT p.category_name, p.brand_name, COUNT(DISTINCT o.order_id) as order_count, SUM(o.quantity) as total_quantity, SUM(o.actual_price * o.quantity) as total_amount FROM orders o LEFT JOIN products p ON o.product_id = p.product_id WHERE o.order_time >= '2024-02-22 00:00:00' GROUP BY p.category_name, p.brand_name ORDER BY total_amount DESC;
+---------------------+------------+-------------+----------------+--------------+ | category_name | brand_name | order_count | total_quantity | total_amount | +---------------------+------------+-------------+----------------+--------------+ | Computers & Office | Apple | 1 | 1 | 19599.00 | | Mobile & Digital | Apple | 1 | 1 | 8899.00 | | Phone Accessories | Apple | 1 | 2 | 3798.00 | +---------------------+------------+-------------+----------------+--------------+
In this kind of query, you frequently need to look up other product information by product_id, which is essentially a KV lookup operation.
Step 3: Use a dictionary instead of JOIN
Once the key-value relationship is defined, you can build the corresponding dictionary in advance to fully convert the previous JOIN operation into a more lightweight key-value lookup, which improves SQL execution efficiency:
-- Create the product information dictionary CREATE DICTIONARY product_info_dict USING products ( product_id KEY, product_name VALUE, brand_name VALUE, category_name VALUE, retail_price VALUE ) LAYOUT(HASH_MAP) PROPERTIES( 'data_lifetime'='300' -- Considering how often product information changes, refresh every 5 minutes );
With the dictionary, the JOIN operation in the original query is converted into a dict_get function lookup, which is a lightweight KV lookup:
SELECT dict_get("test.product_info_dict", "category_name", o.product_id) as category_name, dict_get("test.product_info_dict", "brand_name", o.product_id) as brand_name, COUNT(DISTINCT o.order_id) as order_count, SUM(o.quantity) as total_quantity, SUM(o.actual_price * o.quantity) as total_amount FROM orders o WHERE o.order_time >= '2024-02-22 00:00:00' GROUP BY dict_get("test.product_info_dict", "category_name", o.product_id), dict_get("test.product_info_dict", "brand_name", o.product_id) ORDER BY total_amount DESC;
+---------------------+------------+-------------+----------------+--------------+ | category_name | brand_name | order_count | total_quantity | total_amount | +---------------------+------------+-------------+----------------+--------------+ | Computers & Office | Apple | 1 | 1 | 19599.00 | | Mobile & Digital | Apple | 1 | 1 | 8899.00 | | Phone Accessories | Apple | 1 | 2 | 3798.00 | +---------------------+------------+-------------+----------------+--------------+
CREATE DICTIONARY <dict_name> USING <source_table> ( <key_column> KEY[, ..., <key_columns> VALUE] <value_column> VALUE[, ..., <value_columns> VALUE] ) LAYOUT(<layout_type>) PROPERTIES( "<priority_item_key>" = "<priority_item_value>"[, ..., "<priority_item_key>" = "<priority_item_value>"] );
Parameter descriptions:
| Parameter | Meaning |
|---|---|
<dict_name> | The name of the dictionary |
<source_table> | The source data table |
<key_column> | The column name in the source table that serves as a key |
<value_column> | The column name in the source table that serves as a value |
<layout_type> | The storage layout type of the dictionary, see below for details |
<priority_item_key> | The name of a property of the dictionary |
<priority_item_value> | The value of a property of the dictionary |
There must be at least one <key_column> and one <value_column>. The <key_column> does not have to appear before the <value_column>.
Two layout types are currently supported:
| Layout Type | Use Case | Description |
|---|---|---|
HASH_MAP | General key-value lookup scenarios | Hash-table-based implementation |
IP_TRIE | IP address lookups | Trie-based implementation, optimized specifically for IP address lookups. The key column must be IP addresses in CIDR notation, and queries are matched against the CIDR notation |
| Property | Value Type | Meaning | Required |
|---|---|---|---|
data_lifetime | Integer, in seconds | Data validity period. When the time since the last update of the dictionary exceeds this value and the base table has data changes, a re-import is automatically triggered. See Auto Import for the import logic | Yes |
skip_null_key | Boolean | When loading data into the dictionary, if a null value appears in the key column, skip that row when this property is true; otherwise, an error is reported. The default value is false | No |
memory_limit | Integer, in bytes | The upper limit of memory used by the dictionary on a single BE. The default value is 2147483648, that is, 2GB | No |
-- Create the source data table CREATE TABLE source_table ( id INT NOT NULL, city VARCHAR(32) NOT NULL, code VARCHAR(32) NOT NULL ) ENGINE=OLAP DISTRIBUTED BY HASH(id) BUCKETS 1; -- Create the dictionary CREATE DICTIONARY city_dict USING source_table ( city KEY, id VALUE ) LAYOUT(HASH_MAP) PROPERTIES('data_lifetime' = '600');
Based on this table, you can use the city_dict dictionary together with the dict_get function to look up the corresponding id by the city value of source_table.
1. Key column restrictions
2. Null value handling
skip_null_key in the Properties.Dictionaries support both automatic and manual loading. Loading a dictionary is also referred to as a “refresh” operation.
Automatic loading happens at the following times:
Doris checks whether all dictionary data has expired every dictionary_auto_refresh_interval_seconds seconds. When a dictionary has not been updated for more than data_lifetime seconds and the base table data has changed since the last load, Doris automatically submits a load for that dictionary.
If some BEs are missing data and the base table data has not changed since the last load, Doris only fills in the current version of the data on the affected BEs. It does not submit a refresh task for all BEs, and the dictionary version does not change.
Purpose: Manually trigger a refresh of dictionary data.
Command:
REFRESH DICTIONARY <dict_name>;
Description: <dict_name> is the name of the dictionary whose data is to be loaded.
LOADING), a manual load fails. Wait for the in-progress load to finish before trying again.memory_limit, the load transaction fails.You can use the dict_get and dict_get_many functions to query a dictionary with a single key/value column or with multiple key/value columns, respectively.
For the first query, wait until the dictionary load completes.
dict_get("<db_name>.<dict_name>", "<query_column>", <query_key_value>); dict_get_many("<db_name>.<dict_name>", <query_columns>, <query_key_values>);
Parameter descriptions:
| Parameter | Description |
|---|---|
<db_name> | The name of the database that the dictionary belongs to |
<dict_name> | The name of the dictionary |
<query_column> | The name of the value column to query, of type VARCHAR, must be a constant |
<query_columns> | The names of all value columns to query, of type ARRAY<VARCHAR>, must be a constant |
<query_key_value> | The key column data used for the query |
<query_key_values> | A STRUCT containing the data to query for all key columns of the dictionary |
Return types:
dict_get is the type of the dictionary column corresponding to <query_column>.dict_get_many is a STRUCT composed of the types of the dictionary columns corresponding to <query_columns>.Example 1: Single key, single value query
Query the city_dict dictionary in the test_db database to retrieve the id value when the key column value is “Beijing”:
SELECT dict_get("test_db.city_dict", "id", "Beijing");
Example 2: Single key, multiple value query
Query the single_key_dict dictionary in the test_db database to retrieve the k1 and k3 column values when the key column value is 1:
SELECT dict_get_many("test_db.single_key_dict", ["k1", "k3"], struct(1));
Example 3: Multiple key, multiple value query
Query the multi_key_dict dictionary in the test_db database to retrieve the k2 and k3 column values when the two key column values are 2 and ‘ABC’ respectively:
SELECT dict_get_many("test_db.multi_key_dict", ["k2", "k3"], struct(2, 'ABC'));
For example, if the table creation statement is:
create table if not exists multi_key_table( k0 int not null, k1 varchar not null, k2 float not null, k3 varchar not null ) DISTRIBUTED BY HASH(`k0`) BUCKETS auto; create dictionary multi_key_dict using multi_key_table ( k0 KEY, k1 KEY, k2 VALUE, k3 VALUE ) LAYOUT(HASH_MAP) PROPERTIES('data_lifetime' = '600');
Then the return type of the previous statement
SELECT dict_get_many("test_db.multi_key_dict", ["k2", "k3"], struct(2, 'ABC'));
is STRUCT<float, varchar>.
<query_key_value> must be IPV4 or IPV6.<key_column> and the <query_key_value> used for the query support both IPV4 and IPV6 formats.dictionary_auto_refresh_interval_seconds can shorten the period during which the dictionary is unavailable.Dictionaries support the following management and inspection statements:
1. View the status of all dictionaries in the current database
SHOW DICTIONARIES [LIKE <LIKE_NAME>];
2. View the definition of a specific dictionary
DESC DICTIONARY <dict_name>;
3. Drop a dictionary
DROP DICTIONARY <dict_name>;
After a dictionary is dropped, it may not be removed from the BEs immediately.
Dictionaries support the following configuration items, all of which are FE CONFIG:
| Configuration Item | Default Value | Dynamically Adjustable | Description |
|---|---|---|---|
dictionary_task_queue_size | 1024 | No | The queue length of the thread pool for all dictionary tasks. Usually does not need to be adjusted |
job_dictionary_task_consumer_thread_num | 3 | No | The number of threads in the thread pool for all dictionary tasks |
dictionary_rpc_timeout_ms | 5000 (5s) | Yes | The timeout for all dictionary-related RPCs. Usually does not need to be adjusted |
dictionary_auto_refresh_interval_seconds | 5 (seconds) | Yes | The interval for automatically checking whether all dictionary data has expired |
The SHOW DICTIONARIES statement shows the base table corresponding to each dictionary, the current data version number, and the corresponding status on the FE and BE.
> SHOW DICTIONARIES;
+--------------+----------------+----------------------------------------------+---------+--------+------------------------------------+------------------------------+ | DictionaryId | DictionaryName | BaseTableName | Version | Status | DataDistribution | LastUpdateResult | +--------------+----------------+----------------------------------------------+---------+--------+------------------------------------+------------------------------+ | 51 | precision_dict | internal.test_refresh_dict.precision_test | 2 | NORMAL | {10.16.10.2:9767 ver=2 memory=368} | 2025-02-18 09:58:12: succeed | | 48 | product_dict | internal.test_refresh_dict.product_info | 2 | NORMAL | {10.16.10.2:9767 ver=2 memory=240} | 2025-02-18 09:58:12: succeed | | 49 | ip_dict | internal.test_refresh_dict.ip_info | 2 | NORMAL | {10.16.10.2:9767 ver=2 memory=194} | 2025-02-18 09:58:12: succeed | | 52 | order_dict | internal.test_refresh_dict.column_order_test | 2 | NORMAL | {10.16.10.2:9767 ver=2 memory=432} | 2025-02-18 09:58:12: succeed | | 50 | user_dict | internal.test_refresh_dict.user_info | 2 | NORMAL | {10.16.10.2:9767 ver=2 memory=240} | 2025-02-18 09:58:12: succeed | +--------------+----------------+----------------------------------------------+---------+--------+------------------------------------+------------------------------+
Field meanings:
Version: The data version number, which is incremented by 1 each time data is loaded.
Status: The dictionary status, with the following meanings:
| Status Name | Meaning |
|---|---|
| NORMAL | The dictionary is currently normal |
| LOADING | The dictionary is currently loading data |
| OUT_OF_DATE | The dictionary data has expired |
A dictionary cannot be loaded again while a load is already in progress.
DataDistribution: The current state on each BE, including the version number and memory usage (KB).
LastUpdateResult: The result of the previous load (either automatic or manual). If there is an error, detailed information is shown here.
To view the column definitions of a dictionary, use DESC DICTIONARY. For example:
> DESC DICTIONARY city_code_dict; +-------------+-------------+------+-------+ | Field | Type | Null | Key | +-------------+-------------+------+-------+ | city_name | varchar(32) | NO | true | | region_code | varchar(32) | NO | false | +-------------+-------------+------+-------+
data_lifetime of the dictionary appropriately to allow for automatic updates, and supplement with manual updates based on business logic when necessary.Choose key and value columns reasonably:
Layout selection:
State management:
-- Create the source data table CREATE TABLE cities ( city_id INT NOT NULL, city_name VARCHAR(32) NOT NULL, region_code VARCHAR(32) NOT NULL ) ENGINE=OLAP DISTRIBUTED BY HASH(city_id) BUCKETS 1; -- Insert data INSERT INTO cities VALUES (1, 'Beijing', 'BJ'), (2, 'Shanghai', 'SH'), (3, 'Guangzhou', 'GZ'); -- Create the dictionary CREATE DICTIONARY city_code_dict USING cities ( city_name KEY, region_code VALUE ) LAYOUT(HASH_MAP) PROPERTIES('data_lifetime' = '600'); -- Query using the dictionary SELECT dict_get("test_refresh_dict.city_code_dict", "region_code", "Beijing");
+------------------------------------------------------------------------+ | dict_get('test_refresh_dict.city_code_dict', 'region_code', 'Beijing') | +------------------------------------------------------------------------+ | BJ | +------------------------------------------------------------------------+
-- Create the source data table CREATE TABLE ip_locations ( ip_range VARCHAR(30) NOT NULL, country VARCHAR(64) NOT NULL, region VARCHAR(64) NOT NULL, city VARCHAR(64) NOT NULL ) ENGINE=OLAP DISTRIBUTED BY HASH(ip_range) BUCKETS 1; -- Insert some sample data INSERT INTO ip_locations VALUES ('1.0.0.0/24', 'United States', 'California', 'Los Angeles'), ('1.0.1.0/24', 'China', 'Beijing', 'Beijing'), ('1.0.4.0/24', 'Japan', 'Tokyo', 'Tokyo'); -- Create the IP address dictionary CREATE DICTIONARY ip_location_dict USING ip_locations ( ip_range KEY, country VALUE, region VALUE, city VALUE ) LAYOUT(IP_TRIE) PROPERTIES('data_lifetime' = '600'); -- Query the location information for an IP address, matched by CIDR. SELECT dict_get("test_refresh_dict.ip_location_dict", "country", cast('1.0.0.1' as ipv4)) AS country, dict_get("test_refresh_dict.ip_location_dict", "region", cast('1.0.0.2' as ipv4)) AS region, dict_get("test_refresh_dict.ip_location_dict", "city", cast('1.0.0.3' as ipv4)) AS city;
+---------------+------------+-------------+ | country | region | city | +---------------+------------+-------------+ | United States | California | Los Angeles | +---------------+------------+-------------+
-- Product SKU dimension table: contains the basic attributes of products CREATE TABLE product_sku_info ( product_id INT NOT NULL COMMENT "Product ID", color_code VARCHAR(32) NOT NULL COMMENT "Color code", size_code VARCHAR(32) NOT NULL COMMENT "Size code", product_name VARCHAR(128) NOT NULL COMMENT "Product name", color_name VARCHAR(32) NOT NULL COMMENT "Color name", size_name VARCHAR(32) NOT NULL COMMENT "Size name", stock INT NOT NULL COMMENT "Stock", price DECIMAL(10,2) NOT NULL COMMENT "Price", update_time DATETIME NOT NULL COMMENT "Update time" ) DISTRIBUTED BY HASH(`product_id`) BUCKETS 10; -- Order detail table: records actual sales data CREATE TABLE order_details ( order_id BIGINT NOT NULL COMMENT "Order ID", product_id INT NOT NULL COMMENT "Product ID", color_code VARCHAR(32) NOT NULL COMMENT "Color code", size_code VARCHAR(32) NOT NULL COMMENT "Size code", quantity INT NOT NULL COMMENT "Purchase quantity", order_time DATETIME NOT NULL COMMENT "Order time" ) DISTRIBUTED BY HASH(`order_id`) BUCKETS 10; -- Insert product SKU data INSERT INTO product_sku_info VALUES (1001, 'BLK', 'M', 'Nike Sports T-Shirt', 'Black', 'Size M', 100, 199.00, '2024-02-23 10:00:00'), (1001, 'BLK', 'L', 'Nike Sports T-Shirt', 'Black', 'Size L', 80, 199.00, '2024-02-23 10:00:00'), (1001, 'WHT', 'M', 'Nike Sports T-Shirt', 'White', 'Size M', 90, 199.00, '2024-02-23 10:00:00'), (1001, 'WHT', 'L', 'Nike Sports T-Shirt', 'White', 'Size L', 70, 199.00, '2024-02-23 10:00:00'), (1002, 'RED', 'S', 'Adidas Sports Pants', 'Red', 'Size S', 50, 299.00, '2024-02-23 10:00:00'), (1002, 'RED', 'M', 'Adidas Sports Pants', 'Red', 'Size M', 60, 299.00, '2024-02-23 10:00:00'), (1002, 'BLU', 'S', 'Adidas Sports Pants', 'Blue', 'Size S', 55, 299.00, '2024-02-23 10:00:00'), (1002, 'BLU', 'M', 'Adidas Sports Pants', 'Blue', 'Size M', 65, 299.00, '2024-02-23 10:00:00'); -- Insert order data INSERT INTO order_details VALUES (10001, 1001, 'BLK', 'M', 2, '2024-02-23 12:01:00'), (10002, 1001, 'WHT', 'L', 1, '2024-02-23 12:05:00'), (10003, 1002, 'RED', 'S', 1, '2024-02-23 12:10:00'), (10004, 1001, 'BLK', 'L', 3, '2024-02-23 12:15:00'), (10005, 1002, 'BLU', 'M', 2, '2024-02-23 12:20:00'); -- Create a multi-key, multi-value dictionary CREATE DICTIONARY sku_dict USING product_sku_info ( product_id KEY, color_code KEY, size_code KEY, product_name VALUE, color_name VALUE, size_name VALUE, price VALUE, stock VALUE ) LAYOUT(HASH_MAP) PROPERTIES('data_lifetime'='300'); -- Example of a query using dict_get_many: get order details and SKU information WITH order_sku_info AS ( SELECT o.order_id, o.quantity, o.order_time, dict_get_many("test.sku_dict", ["product_name", "color_name", "size_name", "price", "stock"], struct(o.product_id, o.color_code, o.size_code) ) as sku_info FROM order_details o WHERE o.order_time >= '2024-02-23 12:00:00' AND o.order_time < '2024-02-23 13:00:00' ) SELECT order_id, order_time, struct_element(sku_info, 'product_name') as product_name, struct_element(sku_info, 'color_name') as color_name, struct_element(sku_info, 'size_name') as size_name, quantity, struct_element(sku_info, 'price') as unit_price, quantity * struct_element(sku_info, 'price') as total_amount, struct_element(sku_info, 'stock') as current_stock FROM order_sku_info ORDER BY order_time;
+----------+---------------------+---------------------+------------+-----------+----------+------------+--------------+---------------+ | order_id | order_time | product_name | color_name | size_name | quantity | unit_price | total_amount | current_stock | +----------+---------------------+---------------------+------------+-----------+----------+------------+--------------+---------------+ | 10001 | 2024-02-23 12:01:00 | Nike Sports T-Shirt | Black | Size M | 2 | 199.00 | 398.00 | 100 | | 10002 | 2024-02-23 12:05:00 | Nike Sports T-Shirt | White | Size L | 1 | 199.00 | 199.00 | 70 | | 10003 | 2024-02-23 12:10:00 | Adidas Sports Pants | Red | Size S | 1 | 299.00 | 299.00 | 50 | | 10004 | 2024-02-23 12:15:00 | Nike Sports T-Shirt | Black | Size L | 3 | 199.00 | 597.00 | 80 | | 10005 | 2024-02-23 12:20:00 | Adidas Sports Pants | Blue | Size M | 2 | 299.00 | 598.00 | 65 | +----------+---------------------+---------------------+------------+-----------+----------+------------+--------------+---------------+
| Error Symptom | Solution |
|---|---|
Query reports can not find dict name | First, run SHOW DICTIONARIES to confirm whether the dictionary exists. If it does, refresh the corresponding dictionary data again |
Query reports dict_get() only support IP type for IP_TRIE | Check whether the key column of the IP_TRIE dictionary strictly follows the CIDR format |
Load reports Version ID is not greater than the existing version ID for the dictionary. | Use the DROP DICTIONARY command to drop the corresponding dictionary, then re-create it and reload the data |
SHOW DICTIONARIES shows that the Version of the dictionary on a certain BE is greater than the FE Version | Use the DROP DICTIONARY command to drop the corresponding dictionary, then re-create it and reload the data |
Load reports Dictionary X commit version Y failed | Reload the dictionary |
Fallback strategy: For the vast majority of errors, if normal operations fail, dropping the dictionary and re-creating it can resolve the issue.
Q1: What is the difference between a dictionary and a regular materialized view?
A dictionary is a pure in-memory KV structure, designed specifically to accelerate key-value lookup operations and to convert the original JOIN operation into a dict_get function call. A materialized view, on the other hand, is aimed at more general precomputation scenarios.
Q2: Is the data of a dictionary kept strongly consistent with the base table?
No. Doris does not maintain strong data consistency between a dictionary and its base table. Data needs to be synchronized through automatic updates via data_lifetime or manual REFRESH DICTIONARY.
Q3: When should you choose IP_TRIE over HASH_MAP?
Use IP_TRIE when you need to perform IP range matching queries based on CIDR. For all other key-value matching scenarios, use HASH_MAP.
Q4: What should you do if a dictionary uses too much memory?
You can use the memory_limit property to limit the memory upper bound on a single BE. It is also recommended to choose columns with moderate cardinality as the source for the dictionary, to avoid the dictionary becoming too large.
Q5: What are the possible reasons a dictionary query returns null?
When the queried key does not exist in the dictionary, or when the queried key data is null, null is returned.