Apache Doris supports importing files from Google Cloud Storage (GCS) by accessing GCS buckets through the S3-compatible protocol. This document describes two import methods and their complete operating procedures.
Doris provides two methods for importing files from Google Cloud Storage. Choose one based on data volume and timeliness requirements:
| Import Method | Execution Mode | Applicable Scenario | Reference |
|---|---|---|---|
| S3 Load | Asynchronous | Large-batch data imports, long-running jobs | Broker Load Manual |
| S3 TVF (table function) | Synchronous | Small-batch data imports, ad-hoc queries, quick validation | This document |
Recommendations:
INSERT INTO ... SELECT, use S3 TVF.Prepare the following information before importing:
your_bucket_name).storage.us-west2.rep.googleapis.com and US-WEST2).S3 Load is suitable for asynchronous imports of large data volumes. For detailed parameters and advanced usage, see the Broker Load Manual.
Create a CSV file s3load_example.csv and upload it to Google Cloud Storage. The file contents are:
1,Emily,25 2,Benjamin,35 3,Olivia,28 4,Alexander,60 5,Ava,17 6,William,69 7,Sophia,32 8,James,64 9,Emma,37 10,Liam,64
CREATE TABLE test_s3load( user_id BIGINT NOT NULL COMMENT "user id", name VARCHAR(20) COMMENT "name", age INT COMMENT "age" ) DUPLICATE KEY(user_id) DISTRIBUTED BY HASH(user_id) BUCKETS 10;
LOAD LABEL s3_load_2022_04_01 ( DATA INFILE("s3://your_bucket_name/s3load_example.csv") INTO TABLE test_s3load COLUMNS TERMINATED BY "," FORMAT AS "CSV" (user_id, name, age) ) WITH S3 ( "provider" = "GCP", "s3.endpoint" = "storage.us-west2.rep.googleapis.com", "s3.region" = "US-WEST2", "s3.access_key" = "<your-ak>", "s3.secret_key" = "<your-sk>" ) PROPERTIES ( "timeout" = "3600" );
Key parameters:
| Parameter | Description |
|---|---|
provider | Object storage provider. For GCS, this is fixed at GCP |
s3.endpoint | GCS S3-compatible access endpoint |
s3.region | Region where the GCS bucket is located |
s3.access_key | GCS access key ID |
s3.secret_key | GCS secret access key |
timeout | Import timeout, in seconds |
SELECT * FROM test_s3load;
Expected result:
mysql> select * from test_s3load; +---------+-----------+------+ | user_id | name | age | +---------+-----------+------+ | 5 | Ava | 17 | | 10 | Liam | 64 | | 7 | Sophia | 32 | | 9 | Emma | 37 | | 1 | Emily | 25 | | 4 | Alexander | 60 | | 2 | Benjamin | 35 | | 3 | Olivia | 28 | | 6 | William | 69 | | 8 | James | 64 | +---------+-----------+------+ 10 rows in set (0.04 sec)
The S3 table function (TVF) is suitable for synchronous imports and ad-hoc queries, and works directly with INSERT INTO ... SELECT.
Create a CSV file s3load_example.csv and upload it to Google Cloud Storage. The file contents are:
1,Emily,25 2,Benjamin,35 3,Olivia,28 4,Alexander,60 5,Ava,17 6,William,69 7,Sophia,32 8,James,64 9,Emma,37 10,Liam,64
CREATE TABLE test_s3load( user_id BIGINT NOT NULL COMMENT "user id", name VARCHAR(20) COMMENT "name", age INT COMMENT "age" ) DUPLICATE KEY(user_id) DISTRIBUTED BY HASH(user_id) BUCKETS 10;
INSERT INTO test_s3load SELECT * FROM S3 ( "uri" = "s3://your_bucket_name/s3load_example.csv", "format" = "csv", "provider" = "GCP", "s3.endpoint" = "storage.us-west2.rep.googleapis.com", "s3.region" = "US-WEST2", "s3.access_key" = "<your-ak>", "s3.secret_key" = "<your-sk>", "column_separator" = ",", "csv_schema" = "user_id:int;name:string;age:int" );
Key parameters:
| Parameter | Description |
|---|---|
uri | S3 URI of the file in object storage |
format | File format, such as csv, parquet, or orc |
provider | Object storage provider. For GCS, this is fixed at GCP |
s3.endpoint | GCS S3-compatible access endpoint |
s3.region | Region where the GCS bucket is located |
s3.access_key | GCS access key ID |
s3.secret_key | GCS secret access key |
column_separator | Column separator |
csv_schema | CSV column definitions, in the format column_name:type;column_name:type |
SELECT * FROM test_s3load;
Expected result:
mysql> select * from test_s3load; +---------+-----------+------+ | user_id | name | age | +---------+-----------+------+ | 5 | Ava | 17 | | 10 | Liam | 64 | | 7 | Sophia | 32 | | 9 | Emma | 37 | | 1 | Emily | 25 | | 4 | Alexander | 60 | | 2 | Benjamin | 35 | | 3 | Olivia | 28 | | 6 | William | 69 | | 8 | James | 64 | +---------+-----------+------+ 10 rows in set (0.04 sec)
Q1: What value should the provider parameter use?
When importing from Google Cloud Storage, provider must be set to GCP so that Doris accesses GCS through its S3-compatible protocol.
Q2: How do I choose between S3 Load and S3 TVF?
Q3: How do I obtain the Access Key and Secret Key for GCS?
You can create and manage HMAC keys (Access Key / Secret Key) for S3-compatible access on the Cloud Storage > Settings > Interoperability page in the Google Cloud Console.
Q4: How do I determine the Endpoint and Region?
The Endpoint and Region depend on the region where the bucket is located. For example, the US-WEST2 region corresponds to the endpoint storage.us-west2.rep.googleapis.com. Replace these values according to the actual location of your bucket.
Q5: What should I do if the import fails with a timeout?
Increase the timeout parameter in PROPERTIES (in seconds). The default is 3600. For very large files or slow networks, extend it as needed.