When migrating data from Redshift to Apache Doris, an object storage service (such as Amazon S3) is typically used as an intermediate medium. The overall migration flow is as follows:
Before performing the migration, pay attention to the following points:
TIME type, you must first CAST it to VARCHAR before exporting.Before the migration, map the Redshift data types to Doris data types according to the table below:
| Redshift | Doris | Notes |
|---|---|---|
| SMALLINT | SMALLINT | |
| INTEGER | INT | |
| BIGINT | BIGINT | |
| DECIMAL | DECIMAL | |
| REAL | FLOAT | |
| DOUBLE PRECISION | DOUBLE | |
| BOOLEAN | BOOLEAN | |
| CHAR | CHAR | |
| VARCHAR | VARCHAR/STRING | VARCHAR maximum length is 65535 |
| DATE | DATE | |
| TIMESTAMP | DATETIME | |
| TIME/TIMEZ | STRING | |
| SUPER | VARIANT | |
| OTHER | UNSUPPORTED |
Before the migration, create the corresponding target table in Doris.
Goal: Plan the Doris data model, partitioning, and bucketing strategy based on the Redshift source table schema.
Redshift source table example:
CREATE TABLE sales_data ( order_id INTEGER, customer_name VARCHAR(128), order_date DATE, amount DECIMAL(10,2), country VARCHAR(48) ) DISTSTYLE AUTO; INSERT INTO sales_data VALUES (1, 'Alice', '2025-04-08', 99.99, 'USA'), (2, 'Bob', '2025-04-08', 149.50, 'Canada'), (3, 'Charlie', '2025-04-09', 75.00, 'UK'), (4, 'Diana', '2025-04-10', 200.00, 'Australia');
Doris target table example: Based on the schema above, create a Unique Key partitioned table that uses order_date as the partition column and partitions by day:
CREATE TABLE `sales_data` ( order_id INT, order_date DATE NOT NULL, customer_name VARCHAR(128), amount DECIMAL(10,2), country VARCHAR(48) ) ENGINE=OLAP UNIQUE KEY(`order_id`,`order_date`) PARTITION BY RANGE(`order_date`) ( PARTITION p20250408 VALUES [('2025-04-08'), ('2025-04-09')), PARTITION p20250409 VALUES [('2025-04-09'), ('2025-04-10')), PARTITION p20250410 VALUES [('2025-04-10'), ('2025-04-11')) ) DISTRIBUTED BY HASH(`order_id`) BUCKETS 16 PROPERTIES ( "dynamic_partition.enable" = "true", "dynamic_partition.time_unit" = "DAY", "dynamic_partition.end" = "5", "dynamic_partition.prefix" = "p", "dynamic_partition.buckets" = "16", "replication_num" = "1" );
Goal: Export Redshift data as Parquet files split by the Doris partition column, so that the data can later be loaded partition by partition.
Command:
unload ('select * from sales_data') to 's3://mybucket/redshift/sales_data/' iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole' PARQUET PARTITION BY (order_date) INCLUDE
Note: It is recommended to export with the same partition column used in Doris, so that the data can be loaded in batches by partition.
After the export, S3 generates a subdirectory for each partition, and each directory contains the data of one partition:
The load uses S3 Load, an asynchronous data load method in which Doris actively pulls data from the data source after the job is submitted. The data source supports any S3-compatible object storage, including AWS S3, GCS, and AZURE.
Comparison of applicable scenarios:
| Scenario | Recommended method |
|---|---|
| Large data volume, can be processed asynchronously | S3 Load |
| Synchronous data load is required | TVF Load |
| Files contain complex types (Struct/Array/Map) | TVF Load is required |
LOAD LABEL sales_data_2025_04_08 ( DATA INFILE("s3://mybucket/redshift/sales_data/order_date=2025-04-08/*") INTO TABLE sales_data FORMAT AS "parquet" (order_id, order_date, customer_name, amount, country) ) WITH S3 ( "provider" = "S3", "s3.endpoint" = "s3.ap-southeast-1.amazonaws.com", "s3.access_key" = "<ak>", "s3.secret_key"="<sk>", "s3.region" = "ap-southeast-1" );
Because S3 Load is submitted asynchronously, you can use SHOW LOAD with the job label to check load progress:
mysql> show load where label = "sales_data_2025_04_08"\G *************************** 1. row *************************** JobId: 17956078 Label: sales_data_2025_04_08 State: FINISHED Progress: 100.00% (1/1) Type: BROKER EtlInfo: unselected.rows=0; dpp.abnorm.ALL=0; dpp.norm.ALL=2 TaskInfo: cluster:s3.ap-southeast-1.amazonaws.com; timeout(s):3600; max_filter_ratio:0.0; priority:NORMAL ErrorMsg: NULL CreateTime: 2025-04-10 17:50:53 EtlStartTime: 2025-04-10 17:50:54 EtlFinishTime: 2025-04-10 17:50:54 LoadStartTime: 2025-04-10 17:50:54 LoadFinishTime: 2025-04-10 17:50:54 URL: NULL JobDetails: {"Unfinished backends":{"5eec1be8612d4872-91040ff1e7208a4f":[]},"ScannedRows":2,"TaskNumber":1,"LoadBytes":91,"All backends":{"5eec1be8612d4872-91040ff1e7208a4f":[10022]},"FileNumber":1,"FileSize":1620} TransactionId: 766228 ErrorTablets: {} User: root Comment: 1 row in set (0.00 sec)
Step 1: Query failed load jobs. When a batch load contains multiple jobs, you can use the following statement to filter the labels and reasons of the failed jobs:
mysql> show load where state='CANCELLED' and label like "label_test%"\G *************************** 1. row *************************** JobId: 18312384 Label: label_test123 State: CANCELLED Progress: 100.00% (3/3) Type: BROKER EtlInfo: unselected.rows=0; dpp.abnorm.ALL=4; dpp.norm.ALL=0 TaskInfo: cluster:s3.ap-southeast-1.amazonaws.com; timeout(s):14400; max_filter_ratio:0.0; priority:NORMAL ErrorMsg: type:ETL_QUALITY_UNSATISFIED; msg:quality not good enough to cancel CreateTime: 2025-04-15 17:32:59 EtlStartTime: 2025-04-15 17:33:02 EtlFinishTime: 2025-04-15 17:33:02 LoadStartTime: 2025-04-15 17:33:02 LoadFinishTime: 2025-04-15 17:33:02 URL: http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342 JobDetails: {"Unfinished backends":{"7602ccd7c3a4854-95307efca7bfe341":[]},"ScannedRows":4,"TaskNumber":1,"LoadBytes":188,"All backends":{"7602ccd7c3a4854-95307efca7bfe341":[10022]},"FileNumber":3,"FileSize":4839} TransactionId: 769213 ErrorTablets: {} User: root Comment:
Step 2: Interpret the error type. The example above is a data quality error (ETL_QUALITY_UNSATISFIED). Open the URL returned in the output to see the detailed error. For example, the error below indicates that the data exceeds the actual length of the country column defined in the schema:
[root@VM-10-6-centos ~]$ curl "http://10.16.10.6:28747/api/_load_error_log?file=__shard_2/error_log_insert_stmt_7602ccd7c3a4854-95307efca7bfe342_7602ccd7c3a4854_95307efca7bfe342" Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [USA] schema length: 1; actual length: 3; . src line []; Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Canada] schema length: 1; actual length: 6; . src line []; Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [UK] schema length: 1; actual length: 2; . src line []; Reason: column_name[country], the length of input is too long than schema. first 32 bytes of input str: [Australia] schema length: 1; actual length: 9; . src line [];
Step 3: Set an error tolerance ratio (optional). If you can tolerate skipping some erroneous rows, set the error tolerance ratio in the PROPERTIES of the S3 Load. For details, see Load Configuration Parameters.
When migrating large historical datasets, follow these guidelines for batch loading:
Q1: Why can the Redshift TIME type not be loaded directly into Doris?
Doris does not support the TIME type. When Redshift exports the TIME/TIMEZ type, you must use CAST to convert it to VARCHAR before exporting, and store it as STRING in Doris.
Q2: Can S3 Load be used for Parquet/ORC files that contain complex types such as Struct/Array/Map?
No. Currently, you must use TVF Load for files that contain complex types.
Q3: Is S3 Load synchronous or asynchronous? How can the result be queried?
S3 Load is an asynchronous load method. After submitting the job, run SHOW LOAD WHERE label = "<your_label>" to check the progress and result.
Q4: How should the ETL_QUALITY_UNSATISFIED error be handled?
This error indicates that the data quality does not meet requirements. To handle it:
URL returned in the SHOW LOAD output to retrieve the detailed error log.max_filter_ratio error tolerance ratio in the PROPERTIES of the S3 Load.Q5: How can a one-shot load failure be avoided when migrating large data volumes?
It is recommended to load data in batches by Doris partition, keep each batch under 100 GB, and use the s3_load_demo.sh script to automate batch submission.