BrokerLoad and S3Load are commonly used methods for importing large volumes of data. Users can first upload data to HDFS or S3 and then use BrokerLoad and S3Load for parallel data imports. To speed up the import process, Doris uses multithreading to pull data from HDFS/S3 in parallel. This can put significant pressure on HDFS/S3, potentially causing instability for other jobs running on HDFS/S3.
You can use the remote I/O limitation feature of Workload Groups to limit the bandwidth used during the import process, thereby reducing the impact on other operations.
1FE,1BE(16 cores, 64G), test data is clickbench,Before testing, the dataset needs to be uploaded to S3. To save time, we will upload only 10 million rows of data, and then use the TVF (Table-Valued Function) feature to query the data from S3.
Show schema info after upload.
DESC FUNCTION s3 (
"URI" = "https://bucketname/1kw.tsv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="true"
);
// just scan, not return value.
set dry_run_query = true;
SELECT * FROM s3(
"URI" = "https://bucketname/1kw.tsv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="true"
);
MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; +---------+ | read_mb | +---------+ | 837 | +---------+ 1 row in set (0.104 sec) MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; +---------+ | read_mb | +---------+ | 867 | +---------+ 1 row in set (0.070 sec) MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; +---------+ | read_mb | +---------+ | 867 | +---------+ 1 row in set (0.186 sec)
alter workload group normal properties('remote_read_bytes_per_second'='104857600');
// just scan not return.
set dry_run_query = true;
SELECT * FROM s3(
"URI" = "https://bucketname/1kw.tsv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv",
"use_path_style"="true"
);
MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; +---------+ | read_mb | +---------+ | 56 | +---------+ 1 row in set (0.010 sec) MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; +---------+ | read_mb | +---------+ | 131 | +---------+ 1 row in set (0.009 sec) MySQL [(none)]> select cast(REMOTE_SCAN_BYTES_PER_SECOND/1024/1024 as int) as read_mb from information_schema.workload_group_resource_usage; +---------+ | read_mb | +---------+ | 111 | +---------+ 1 row in set (0.009 sec)