docs/data-operate/import/data-source/bigquery.md

{ “title”: “BigQuery”, “language”: “en”, “description”: “A complete guide to migrating BigQuery data into Apache Doris: staging through GCS, importing with S3 Load, including data type mapping, table creation, troubleshooting, and batch import practices.”, “keywords”: [ “BigQuery migration to Doris”, “BigQuery import to Apache Doris”, “GCS S3 Load”, “BigQuery Export Parquet”, “Doris data migration”, “ETL_QUALITY_UNSATISFIED” ] }

This document describes how to migrate data from Google BigQuery into Apache Doris. The overall workflow uses object storage as an intermediate medium:

Use BigQuery's Export statement to export data to GCS (Google Cloud Storage).
Create the corresponding target table in Doris.
Use Doris's S3 Load feature to pull the data from GCS into Doris. For details on the mechanism, see S3 Load.

Notes

Before starting the migration, pay attention to the following items to avoid common issues:

Table schema design: Choose the appropriate Doris data model, as well as partitioning and bucketing strategies based on the BigQuery table schema. For more table creation strategies, see Load best practices.
JSON type export: BigQuery does not support exporting JSON types in Parquet format. Use JSON format for the export instead.
Time type export: When BigQuery exports a Time type, cast it to String before exporting.
Complex type import: For Parquet/ORC files containing complex types (Struct/Array/Map), you must currently use TVF Load.

Data type mapping

Before migrating, create columns in Doris that correspond to the BigQuery fields according to the following rules:

BigQuery	Doris	Notes
Array	Array
BOOLEAN	BOOLEAN
DATE	DATE
DATETIME/TIMESTAMP	DATETIME
JSON	JSON
INT64	BIGINT
NUMERIC	DECIMAL
FLOAT64	DOUBLE
STRING	VARCHAR/STRING	VARCHAR maximum length is 65535
STRUCT	STRUCT
TIME	STRING	Cast to String before exporting
OTHER	UNSUPPORTED	Types not currently supported

Migration steps

Step 1: Create the target table in Doris

Before migrating a BigQuery table to Doris, first create a Doris table with a matching schema.

Assume the following table and sample data already exist in BigQuery:

CREATE OR REPLACE TABLE test.sales_data (
    order_id      INT64,
    customer_name STRING,
    order_date    DATE,
    amount        NUMERIC(10,2),
    country       STRING
)
PARTITION BY  order_date;


INSERT INTO test.sales_data (order_id, customer_name, order_date, amount, country) VALUES
(1, 'Alice', '2025-04-08', 99.99, 'USA'),
(2, 'Bob', '2025-04-08', 149.50, 'Canada'),
(3, 'Charlie', '2025-04-09', 75.00, 'UK'),
(4, 'Diana', '2025-04-10', 200.00, 'Australia');

Based on this schema, you can create a unique key partitioned table in Doris. The partition column matches the BigQuery partition column, and the table is partitioned by day:

CREATE TABLE `sales_data` (
  order_id      INT,
  order_date    DATE NOT NULL,
  customer_name VARCHAR(128),
  amount        DECIMAL(10,2),
  country       VARCHAR(48)
) ENGINE=OLAP
UNIQUE KEY(`order_id`,`order_date`)
PARTITION BY RANGE(`order_date`) (
PARTITION p20250408 VALUES [('2025-04-08'), ('2025-04-09')),
PARTITION p20250409 VALUES [('2025-04-09'), ('2025-04-10')),
PARTITION p20250410 VALUES [('2025-04-10'), ('2025-04-11'))
)
DISTRIBUTED BY HASH(`order_id`) BUCKETS 16
PROPERTIES (
 "dynamic_partition.enable" = "true",
 "dynamic_partition.time_unit" = "DAY",
 "dynamic_partition.end" = "5",
 "dynamic_partition.prefix" = "p",
 "dynamic_partition.buckets" = "16",
 "replication_num" = "1"
);

Step 2: Export BigQuery data to GCS

2.1 Export Parquet files with the Export statement

Use BigQuery's EXPORT DATA to export the target table to GCS:

EXPORT DATA
  OPTIONS (
    uri = 'gs://mybucket/export/sales_data/*.parquet',
    format = 'PARQUET')
AS (
  SELECT *
  FROM test.sales_data 
);

2.2 Inspect the exported files on GCS

The command above exports sales_data to GCS. Each partition produces one or more files with incrementing file names. For details, see exporting-data.

gcs_export

Step 3: Import data into Doris with S3 Load

The import uses S3 Load. S3 Load is an asynchronous data import method. After the job is submitted, Doris actively pulls data from the data source. The data source supports object storage compatible with the S3 protocol, including AWS S3, GCS, and Azure.

This method is suitable for scenarios with large data volumes that need asynchronous background processing. For data imports that need synchronous processing, see TVF Load.

Note: For Parquet/ORC files containing complex types (Struct/Array/Map), you must currently use TVF Load.

3.1 Import data from a single file

Use the following LOAD LABEL statement to import a single Parquet file from GCS:

LOAD LABEL sales_data_2025_04_08
(
    DATA INFILE("s3://mybucket/export/sales_data/000000000000.parquet")
    INTO TABLE sales_data
    FORMAT AS "parquet"
    (order_id, order_date, customer_name, amount, country)
)
WITH S3
(
    "provider" = "GCP",
    "s3.endpoint" = "storage.asia-southeast1.rep.googleapis.com",  
    "s3.region" = "asia-southeast1",
    "s3.access_key" = "<ak>",
    "s3.secret_key" = "<sk>"
);

3.2 Check job status with Show Load

Because S3 Load is submitted asynchronously, use SHOW LOAD to query the import status for a given label:

mysql> show load where label = "sales_data_2025_04_08"\G
*************************** 1. row ***************************
        JobId: 17956078
        Label: sales_data_2025_04_08
        State: FINISHED
      Progress: 100.00% (1/1)
          Type: BROKER
      EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2
      TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL
      ErrorMsg: NULL
    CreateTime: 2025-04-10 17:50:53
  EtlStartTime: 2025-04-10 17:50:54
EtlFinishTime: 2025-04-10 17:50:54
LoadStartTime: 2025-04-10 17:50:54
LoadFinishTime: 2025-04-10 17:50:54
          URL: NULL
    JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620}
TransactionId: 766228
  ErrorTablets: {}
          User: root
      Comment: 
1 row in set (0.00 sec)

3.3 Handle errors during import

When multiple import jobs exist, use the following statement to query failed imports and their causes:

mysql> show load where state='CANCELLED' and label like "label_test%"\G
*************************** 1. row ***************************
        JobId: 18312384
        Label: label_test123
        State: CANCELLED
      Progress: 100.00% (3/3)
          Type: BROKER
      EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0
      TaskInfo: cluster:storage.asia-southeast1.rep.googleapis.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL
      ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel
    CreateTime: 2025-04-15 17:32:59
  EtlStartTime: 2025-04-15 17:33:02
EtlFinishTime: 2025-04-15 17:33:02
LoadStartTime: 2025-04-15 17:33:02
LoadFinishTime: 2025-04-15 17:33:02
          URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342
    JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839}
TransactionId: 769213
  ErrorTablets: {}
          User: root
      Comment:

The example above is a data quality error (ETL_QUALITY_UNSATISFIED). To get the specific cause, visit the returned URL. For example, the error below indicates that the data exceeds the actual length of the country column in the table schema:

[root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342"
Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line []; 
Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line []; 
Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line []; 
Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line [];

For data quality errors, if skipping some erroneous rows is acceptable, set the error tolerance ratio in the S3 Load PROPERTIES. For details on the parameters, see Load configuration parameters.

3.4 Import data from multiple files

When migrating large volumes of historical data, use a batched import strategy:

Batch division: Each batch corresponds to one Doris partition or a small number of partitions.
Per-batch data volume: No more than 100 GB per batch is recommended, to reduce system pressure and lower the cost of retries after a failed import.
Batch script: See s3_load_file_demo.sh. This script splits the file list under a specified directory in object storage and submits multiple S3 Load jobs to Doris in batches, enabling batch import.

FAQ

Q1: Why can JSON types in BigQuery not be exported directly as Parquet?

BigQuery does not currently support exporting JSON types in Parquet format. Set format to JSON for the export, and use a corresponding JSON column in Doris to receive the data.

Q2: How does the BigQuery Time type map to Doris?

The BigQuery Time type must be cast to String during export, and stored in a STRING column in Doris.

Q3: What should I do when an import fails with ETL_QUALITY_UNSATISFIED?

This error indicates that the data does not meet quality requirements. Handle it as follows:

Look at the URL field in the SHOW LOAD result and use curl to fetch the detailed error log.
Identify the cause from the log (common causes include column length overflow and type mismatch).
Adjust the target table schema, or set max_filter_ratio in the S3 Load PROPERTIES to skip some erroneous rows.

Q4: Can Parquet/ORC files containing Struct/Array/Map be imported with S3 Load?

No. Files with these complex types must currently be imported with TVF Load.

Q5: How can I avoid the retry cost caused by single-job failures when migrating large volumes of historical data?

Import in batches by partition, with no more than 100 GB per batch, and use the s3_load_file_demo.sh script to split and submit batches.