If partitions are used, DISTRIBUTED ... statement describes the rules for dividing data within each partition.
If partitions are not used, it describes the rules for dividing the data across the entire table.
It is also possible to specify a bucketing method for each partition individually.
The bucket columns can be multiple columns. For the Aggregate and Unique models, they must be Key columns, while for the duplicate key data model, they can be both key and value columns. Bucket columns can be the same as or different from Partition columns.
The choice of bucket columns involves a trade-off between query throughput and query concurrency:
ADD PARTITION, the bucket number for the new partition can be specified separately. This feature can be conveniently used to handle data reduction or expansion.Here are some examples: Assuming there are 10 BEs, each with one disk. If a table has a total size of 500MB, 4-8 tablets can be considered. For 5GB: 8-16 tablets. For 50GB: 32 tablets. For 500GB: It is recommended to partition the table, with each partition size around 50GB and 16-32 tablets per partition. For 5TB: It is recommended to partition the table, with each partition size around 50GB and 16-32 tablets per partition.
The data volume of a table can be viewed using the SHOW DATA command, and the result should be divided by the number of replicas to obtain the actual data volume of the table.
load_to_single_tablet to true). Then, during large-volume data import, a task will only write to one tablet when writing data to the corresponding partition. This can improve the concurrency and throughput of data import, reduce the write amplification caused by data import and compaction, and ensure the stability of the cluster.