In Doris, the core of data distribution is to efficiently map the rows of data written to the table onto various data shards (Tablets) in the underlying storage through reasonable partitioning and bucketing strategies. Through data distribution strategies, Doris can fully utilize the storage and computing capabilities of multiple nodes, thereby supporting efficient storage and querying of large-scale data.
When writing data, Doris first allocates the rows of data to the corresponding partitions based on the table's partitioning strategy. Then, according to the bucketing strategy, the rows of data are further mapped to specific shards within the partition, determining the storage location of the data rows.
During query execution, Doris's optimizer will trim data based on partitioning and bucketing strategies to maximize the reduction of the scanning range. In cases involving JOIN or aggregation queries, data transfer across nodes (Shuffle) may occur. Reasonable partitioning and bucketing design can reduce Shuffle and fully utilize Colocate Join to optimize query performance.
A Doris cluster consists of the following two types of nodes:
The data stored in the BE node is divided into shards, with each shard being the smallest unit of data management in Doris and the basic unit for data movement and replication.
Partitioning is the first layer of logical division for data organization, used to divide the data in the table into smaller subsets. Doris provides the following two partition types and three partition modes:
ALTER statements).Bucketing is the second layer of logical division for data organization, used to further divide data rows into smaller units within a partition. Doris supports the following two bucketing methods:
crc32 hash value of the bucketing column and taking the modulus of the number of buckets.load_to_single_tablet option can be used to optimize the quick writing of small-scale data.For large tables that frequently require JOIN or aggregation queries, the Colocate strategy can be enabled to place data with the same bucketing column values on the same physical node, reducing data transfer across nodes and significantly improving query performance.
During queries, Doris can prune irrelevant partitions through filtering conditions, thereby reducing the data scanning range and lowering I/O costs.
During queries, a reasonable number of buckets can fully utilize the computational and I/O resources of the machines.
Uniform Data Distribution Ensure that data is evenly distributed across all BE nodes to avoid data skew that could overload certain nodes, thereby improving overall system performance.
Optimize Query Performance Reasonable partition pruning can significantly reduce the amount of data scanned, a reasonable number of buckets can enhance computational parallelism, and effective use of COLOCATE can lower Shuffle costs, improving JOIN and aggregation query efficiency.
Flexible Data Management
Control Metadata Scale The metadata for each shard is stored in both FE and BE, so it is necessary to reasonably control the number of shards. The empirical recommendation is:
Optimize Write Throughput
By carefully designing and managing partitioning and bucketing strategies, Doris can efficiently support the storage and query processing of large-scale data, meeting various complex business needs.