This document describes how to migrate data from Google BigQuery into Apache Doris. The overall workflow uses object storage as an intermediate medium:
Before starting the migration, pay attention to the following items to avoid common issues:
Before migrating, create columns in Doris that correspond to the BigQuery fields according to the following rules:
| BigQuery | Doris | Notes |
|---|---|---|
| Array | Array | |
| BOOLEAN | BOOLEAN | |
| DATE | DATE | |
| DATETIME/TIMESTAMP | DATETIME | |
| JSON | JSON | |
| INT64 | BIGINT | |
| NUMERIC | DECIMAL | |
| FLOAT64 | DOUBLE | |
| STRING | VARCHAR/STRING | VARCHAR maximum length is 65535 |
| STRUCT | STRUCT | |
| TIME | STRING | Cast to String before exporting |
| OTHER | UNSUPPORTED | Types not currently supported |
Before migrating a BigQuery table to Doris, first create a Doris table with a matching schema.
Assume the following table and sample data already exist in BigQuery:
CREATE OR REPLACE TABLE test.sales_data ( order_id INT64, customer_name STRING, order_date DATE, amount NUMERIC(10,2), country STRING ) PARTITION BY order_date; INSERT INTO test.sales_data (order_id, customer_name, order_date, amount, country) VALUES (1, 'Alice', '2025-04-08', 99.99, 'USA'), (2, 'Bob', '2025-04-08', 149.50, 'Canada'), (3, 'Charlie', '2025-04-09', 75.00, 'UK'), (4, 'Diana', '2025-04-10', 200.00, 'Australia');
Based on this schema, you can create a unique key partitioned table in Doris. The partition column matches the BigQuery partition column, and the table is partitioned by day:
CREATE TABLE `sales_data` ( order_id INT, order_date DATE NOT NULL, customer_name VARCHAR(128), amount DECIMAL(10,2), country VARCHAR(48) ) ENGINE=OLAP UNIQUE KEY(`order_id`,`order_date`) PARTITION BY RANGE(`order_date`) ( PARTITION p20250408 VALUES [('2025-04-08'), ('2025-04-09')), PARTITION p20250409 VALUES [('2025-04-09'), ('2025-04-10')), PARTITION p20250410 VALUES [('2025-04-10'), ('2025-04-11')) ) DISTRIBUTED BY HASH(`order_id`) BUCKETS 16 PROPERTIES ( "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.end" = "5", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "16", "replication_num" = "1" );
Use BigQuery's EXPORT DATA to export the target table to GCS:
EXPORT DATA OPTIONS ( uri = 'gs://mybucket/export/sales_data/*.parquet', format = 'PARQUET') AS ( SELECT * FROM test.sales_data );
The command above exports sales_data to GCS. Each partition produces one or more files with incrementing file names. For details, see exporting-data.
The import uses S3 Load. S3 Load is an asynchronous data import method. After the job is submitted, Doris actively pulls data from the data source. The data source supports object storage compatible with the S3 protocol, including AWS S3, GCS, and Azure.
This method is suitable for scenarios with large data volumes that need asynchronous background processing. For data imports that need synchronous processing, see TVF Load.
Note: For Parquet/ORC files containing complex types (Struct/Array/Map), you must currently use TVF Load.
Use the following LOAD LABEL statement to import a single Parquet file from GCS:
LOAD LABEL sales_data_2025_04_08 ( DATA INFILE("s3://mybucket/export/sales_data/000000000000.parquet") INTO TABLE sales_data FORMAT AS "parquet" (order_id, order_date, customer_name, amount, country) ) WITH S3 ( "provider" = "GCP", "s3.endpoint" = "storage.asia-southeast1.rep.googleapis.com", "s3.region" = "asia-southeast1", "s3.access_key" = "<ak>", "s3.secret_key" = "<sk>" );
Because S3 Load is submitted asynchronously, use SHOW LOAD to query the import status for a given label:
mysql> show load where label = "sales_data_2025_04_08"\G *************************** 1. row *************************** JobId: 17956078 Label: sales_data_2025_04_08 State: FINISHED Progress: 100.00% (1/1) Type: BROKER EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2 TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL ErrorMsg: NULL CreateTime: 2025-04-10 17:50:53 EtlStartTime: 2025-04-10 17:50:54 EtlFinishTime: 2025-04-10 17:50:54 LoadStartTime: 2025-04-10 17:50:54 LoadFinishTime: 2025-04-10 17:50:54 URL: NULL JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620} TransactionId: 766228 ErrorTablets: {} User: root Comment: 1 row in set (0.00 sec)
When multiple import jobs exist, use the following statement to query failed imports and their causes:
mysql> show load where state='CANCELLED' and label like "label_test%"\G *************************** 1. row *************************** JobId: 18312384 Label: label_test123 State: CANCELLED Progress: 100.00% (3/3) Type: BROKER EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0 TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel CreateTime: 2025-04-15 17:32:59 EtlStartTime: 2025-04-15 17:33:02 EtlFinishTime: 2025-04-15 17:33:02 LoadStartTime: 2025-04-15 17:33:02 LoadFinishTime: 2025-04-15 17:33:02 URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342 JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839} TransactionId: 769213 ErrorTablets: {} User: root Comment:
The example above is a data quality error (ETL_QUALITY_UNSATISFIED). To get the specific cause, visit the returned URL. For example, the error below indicates that the data exceeds the actual length of the country column in the table schema:
[root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342" Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line []; Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line []; Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line []; Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line [];
For data quality errors, if skipping some erroneous rows is acceptable, set the error tolerance ratio in the S3 Load PROPERTIES. For details on the parameters, see Load configuration parameters.
When migrating large volumes of historical data, use a batched import strategy:
Q1: Why can JSON types in BigQuery not be exported directly as Parquet?
BigQuery does not currently support exporting JSON types in Parquet format. Set format to JSON for the export, and use a corresponding JSON column in Doris to receive the data.
Q2: How does the BigQuery Time type map to Doris?
The BigQuery Time type must be cast to String during export, and stored in a STRING column in Doris.
Q3: What should I do when an import fails with ETL_QUALITY_UNSATISFIED?
This error indicates that the data does not meet quality requirements. Handle it as follows:
URL field in the SHOW LOAD result and use curl to fetch the detailed error log.max_filter_ratio in the S3 Load PROPERTIES to skip some erroneous rows.Q4: Can Parquet/ORC files containing Struct/Array/Map be imported with S3 Load?
No. Files with these complex types must currently be imported with TVF Load.
Q5: How can I avoid the retry cost caused by single-job failures when migrating large volumes of historical data?
Import in batches by partition, with no more than 100 GB per batch, and use the s3_load_file_demo.sh script to split and submit batches.