blob: 5ba3ad39708d35a0f0a2d476acb78f5b145562c9 [file] [view]
---
{
"title": "Importing Data from Google Cloud Storage (GCS)",
"language": "en",
"description": "How to import data from Google Cloud Storage into Apache Doris: asynchronous import via S3 Load, or synchronous import via S3 TVF.",
"keywords": [
"Google Cloud Storage import",
"GCS import to Doris",
"S3 Load GCP",
"S3 TVF GCS",
"Doris object storage import",
"GCP provider"
],
"sidebar_label": "Google Cloud Storage"
}
---
<!-- Knowledge type: Procedure -->
<!-- Applicable scenario: Importing data from Google Cloud Storage into Apache Doris -->
Apache Doris supports importing files from Google Cloud Storage (GCS) by accessing GCS buckets through the S3-compatible protocol. This document describes two import methods and their complete operating procedures.
## Choosing an Approach
Doris provides two methods for importing files from Google Cloud Storage. Choose one based on data volume and timeliness requirements:
| Import Method | Execution Mode | Applicable Scenario | Reference |
|---------|---------|---------|---------|
| S3 Load | Asynchronous | Large-batch data imports, long-running jobs | [Broker Load Manual](../import-way/broker-load-manual.md) |
| S3 TVF (table function) | Synchronous | Small-batch data imports, ad-hoc queries, quick validation | This document |
Recommendations:
- For larger data volumes or jobs that need to run in the background, use **S3 Load**.
- For immediate results or use with `INSERT INTO ... SELECT`, use **S3 TVF**.
## Prerequisites
Prepare the following information before importing:
- Google Cloud Storage bucket name (`your_bucket_name`).
- Access credentials: Access Key and Secret Key.
- GCS Endpoint and Region (for example, `storage.us-west2.rep.googleapis.com` and `US-WEST2`).
- A deployed and accessible Apache Doris cluster.
## Method 1: Import with S3 Load (Asynchronous)
S3 Load is suitable for asynchronous imports of large data volumes. For detailed parameters and advanced usage, see the [Broker Load Manual](../import-way/broker-load-manual.md).
### Step 1: Prepare the Data
Create a CSV file `s3load_example.csv` and upload it to Google Cloud Storage. The file contents are:
```text
1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64
```
### Step 2: Create a Table in Doris
```sql
CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;
```
### Step 3: Import the Data with S3 Load
```sql
LOAD LABEL s3_load_2022_04_01
(
DATA INFILE("s3://your_bucket_name/s3load_example.csv")
INTO TABLE test_s3load
COLUMNS TERMINATED BY ","
FORMAT AS "CSV"
(user_id, name, age)
)
WITH S3
(
"provider" = "GCP",
"s3.endpoint" = "storage.us-west2.rep.googleapis.com",
"s3.region" = "US-WEST2",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>"
)
PROPERTIES
(
"timeout" = "3600"
);
```
Key parameters:
| Parameter | Description |
|------|------|
| `provider` | Object storage provider. For GCS, this is fixed at `GCP` |
| `s3.endpoint` | GCS S3-compatible access endpoint |
| `s3.region` | Region where the GCS bucket is located |
| `s3.access_key` | GCS access key ID |
| `s3.secret_key` | GCS secret access key |
| `timeout` | Import timeout, in seconds |
### Step 4: Verify the Imported Data
```sql
SELECT * FROM test_s3load;
```
Expected result:
```text
mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)
```
## Method 2: Import with TVF (Synchronous)
The S3 table function (TVF) is suitable for synchronous imports and ad-hoc queries, and works directly with `INSERT INTO ... SELECT`.
### Step 1: Prepare the Data
Create a CSV file `s3load_example.csv` and upload it to Google Cloud Storage. The file contents are:
```text
1,Emily,25
2,Benjamin,35
3,Olivia,28
4,Alexander,60
5,Ava,17
6,William,69
7,Sophia,32
8,James,64
9,Emma,37
10,Liam,64
```
### Step 2: Create a Table in Doris
```sql
CREATE TABLE test_s3load(
user_id BIGINT NOT NULL COMMENT "user id",
name VARCHAR(20) COMMENT "name",
age INT COMMENT "age"
)
DUPLICATE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 10;
```
### Step 3: Import the Data with TVF
```sql
INSERT INTO test_s3load
SELECT * FROM S3
(
"uri" = "s3://your_bucket_name/s3load_example.csv",
"format" = "csv",
"provider" = "GCP",
"s3.endpoint" = "storage.us-west2.rep.googleapis.com",
"s3.region" = "US-WEST2",
"s3.access_key" = "<your-ak>",
"s3.secret_key" = "<your-sk>",
"column_separator" = ",",
"csv_schema" = "user_id:int;name:string;age:int"
);
```
Key parameters:
| Parameter | Description |
|------|------|
| `uri` | S3 URI of the file in object storage |
| `format` | File format, such as `csv`, `parquet`, or `orc` |
| `provider` | Object storage provider. For GCS, this is fixed at `GCP` |
| `s3.endpoint` | GCS S3-compatible access endpoint |
| `s3.region` | Region where the GCS bucket is located |
| `s3.access_key` | GCS access key ID |
| `s3.secret_key` | GCS secret access key |
| `column_separator` | Column separator |
| `csv_schema` | CSV column definitions, in the format `column_name:type;column_name:type` |
### Step 4: Verify the Imported Data
```sql
SELECT * FROM test_s3load;
```
Expected result:
```text
mysql> select * from test_s3load;
+---------+-----------+------+
| user_id | name | age |
+---------+-----------+------+
| 5 | Ava | 17 |
| 10 | Liam | 64 |
| 7 | Sophia | 32 |
| 9 | Emma | 37 |
| 1 | Emily | 25 |
| 4 | Alexander | 60 |
| 2 | Benjamin | 35 |
| 3 | Olivia | 28 |
| 6 | William | 69 |
| 8 | James | 64 |
+---------+-----------+------+
10 rows in set (0.04 sec)
```
## FAQ
<!-- Knowledge type: Troubleshooting -->
**Q1: What value should the `provider` parameter use?**
When importing from Google Cloud Storage, `provider` must be set to `GCP` so that Doris accesses GCS through its S3-compatible protocol.
**Q2: How do I choose between S3 Load and S3 TVF?**
- For large data volumes that need to run asynchronously in the background, choose **S3 Load**.
- For small data volumes that need immediate results or that work alongside SQL queries, choose **S3 TVF**.
**Q3: How do I obtain the Access Key and Secret Key for GCS?**
You can create and manage HMAC keys (Access Key / Secret Key) for S3-compatible access on the **Cloud Storage > Settings > Interoperability** page in the Google Cloud Console.
**Q4: How do I determine the Endpoint and Region?**
The Endpoint and Region depend on the region where the bucket is located. For example, the `US-WEST2` region corresponds to the endpoint `storage.us-west2.rep.googleapis.com`. Replace these values according to the actual location of your bucket.
**Q5: What should I do if the import fails with a timeout?**
Increase the `timeout` parameter in `PROPERTIES` (in seconds). The default is `3600`. For very large files or slow networks, extend it as needed.