Apache Doris is compatible with the MySQL protocol and supports the use of the standard MySQL LOAD DATA syntax to import local files. MySQL Load is a synchronous import method where the import result is returned upon completion. That means users can tell whether the import suceeds from the returned result. Generally, the MySQL Load method can be used to ingest files smaller than 10GB in size. For files larger than 10GB, it is recommended to split them into smaller ones. MySQL Load ensures the atomicity of a batch of import tasks, meaning that either all imports succeed or all imports fail.
Supported format
MySQL Load is primarily designed for importing CSV files from the client's local machine or importing data from data streams through programs.
Restrictions
When importing CSV files, it is important to differentiate between null values and empty strings (''):
a,\N,b, the middle column represents a null value.a, ,b, where the middle column represents an empty string.MySQL Load is similar to Stream Load in terms of functionality. They both involve importing local files into the Doris cluster. Therefore, the implementation of MySQL Load reuses the basic import capabilities of Stream Load.
The main processes of MySQL Load include:
MySQL Load requires INSERT permission on the target table. You can grant permissions to user account using the GRANT command.
Create a data file client_local.csv containing the following sample data:
1,10 2,20 3,30 4,40 5,50 6,60
Connect to the MySQL client before executing the LOAD DATA command:
mysql --local-infile -h <fe_ip> -P <fe_query_port> -u root -D testdb
:::caution Specific parameter options need to be used during the connection:
When connecting to the MySQL client, --local-infile must be included, otherwise an error might be thrown.
When connecting via JDBC, allowLoadLocalInfile=true must be specified in the URL. :::
Create the test table
Create a table as follows in Doris:
CREATE TABLE testdb.t1 ( pk INT, v1 INT SUM ) AGGREGATE KEY (pk) DISTRIBUTED BY hash (pk);
After connecting to the MySQL client, create a Load job. The command is as follows:
LOAD DATA LOCAL INFILE 'client_local.csv' INTO TABLE testdb.t1 COLUMNS TERMINATED BY ',' LINES TERMINATED BY '\n';
MySQL Load is a synchronous import method, and the results of the import are returned to the user in the command-line interface. If the import execution fails, it will display specific error messages.
Below is an example of a successful import result, which returns the number of imported rows:
Query OK, 6 row affected (0.17 sec) Records: 6 Deleted: 0 Skipped: 0 Warnings: 0
When there are exceptions during the import, the corresponding error will be displayed on the client:
ERROR 1105 (HY000): errCode = 2, detailMessage = [DATA_QUALITY_ERROR]too many filtered rows with load id b612907c-ccf4-4ac2-82fe-107ece655f0f
The loadId is included in the error message, based on which you can view the detailed information via the show load warnings command:
show load warnings where label='b612907c-ccf4-4ac2-82fe-107ece655f0f';
Doris does not allow manual cancellation of MySQL Load jobs. In the event of a timeout or import error, the corresponding MySQL Load job will be automatically cancelled by the system.
The syntax for LOAD DATA is as follows:
LOAD DATA LOCAL INFILE '<load_data_file>' INTO TABLE [<db_name>.]<table_name> [PARTITION (partition_name [, partition_name] ...)] [COLUMNS TERMINATED BY '<column_terminated_operator>'] [LINES TERMINATED BY '<line_terminated_operator>'] [IGNORE <ignore_lines> LINES] [(col_name_or_user_var[, col_name_or_user_var] ...)] [SET col_name={expr | DEFAULT}[, col_name={expr | DEFAULT}] ...] [PROPERTIES (key1 = value1 [, key2=value2]) ]
Descriptions of modules in the Load job:
| Module | Description |
|---|---|
| INFILE | This specifies the local file path, which can be either a relative path or an absolute path.Currently, load_data_file only supports a single file. |
| INTO TABLE | This specifies the database and table, and the database name can be omitted. |
| PARTITION | This specifies the target partition. If the user can determine the partition corresponding to the data, it is recommended to specify this. Data that does not fit into the specified partitions will be filtered out. |
| COLUMNS TERMINATED BY | This specifies the column delimiter. |
| LINE TERMINATED BY | This specifies the row delimiter. |
| IGNORE num LINES | This specifies the number of rows to skip in the CSV import, typically specified as 1 to skip the header. |
| col_name_or_user_var | This specifies the column mapping syntax. For more information, refer to Column Mapping. |
| PROPERTIES | Parameters for the Load. |
By the PROPERTIES (key1 = value1 [, key2=value2]) syntax, you can configure the parameters for the Load.
| Parameter | Description |
|---|---|
| max_filter_ratio | The maximum filtering rate allowed. Must be between 0 and 1, inclusive. The default value is 0, indicating no tolerance for any error rows. |
| timeout | The import timeout, measured in seconds. The default value is 600. The range allowed is from 1s to 259200s. |
| strict_mode | Whether to enable strict mode for this import. Disabled by default. |
| timezone | The time zone for this import. The default time zone is UTC+8. This parameter will affect the results of any time zone-related functions involved in the import. |
| exec_mem_limit | Memory limit for the import, defaults to 2GB, measured in bytes. |
| trim_double_quotes | Boolean, defaults to false. If this is set to true, the outermost double quotes will be trimmed from each field in the import file. |
| enclose | This specifies the enclosure character. When a CSV data field contains line breaks or column delimiters, you can specify a single-byte character as the enclosure character to prevent accidental truncation.For example, if the column delimiter is “,”, and the enclosure character is “'”, in data “a,‘b,c’”, “b,c” will be parsed as one field. |
| escape | This specifies the escape character. This is used when the data contains the same character as the enclosure character, which needs to be treated as part of the field.For example, if the data is “a,'b,‘c’”, the enclosure character is “'”, and you want “b,'c” to be parsed as one field, you need to specify a single-byte escape character, such as "", to modify the data to “a,‘b,'c’”. |
You can adjust the import timeout by specifying timeout in PROPERTIES. For example, set it to 100s:
LOAD DATA LOCAL INFILE 'testData' INTO TABLE testDb.testTbl PROPERTIES ("timeout"="100");
You can adjust the allowable error rate by specifying max_filter_ratio in PROPERTIES. For example, set it to 20%:
LOAD DATA LOCAL INFILE 'testData' INTO TABLE testDb.testTbl PROPERTIES ("max_filter_ratio"="0.2");
The following example adjusts the order of columns in the CSV file.
LOAD DATA LOCAL INFILE 'testData' INTO TABLE testDb.testTbl (k2, k1, v1);
You can specify the column and row delimiters using the COLUMNS TERMINATED BY and LINES TERMINATED BY clauses. In the following example, (,) and (\n) are used as the column and row delimiters, respectively.
LOAD DATA LOCAL INFILE 'testData' INTO TABLE testDb.testTbl COLUMNS TERMINATED BY ',' LINES TERMINATED BY '\n';
You can specify the target partition for the import using the PARTITION clause. In the following example, data will be loaded into the specified partitions ‘p1’ and ‘p2’, and any data that does not belong to these two partitions will be filtered out:
LOAD DATA LOCAL INFILE 'testData' INTO TABLE testDb.testTbl PARTITION (p1, p2);
You can specify the timezone in PROPERTIES. In the following example, the timezone is set to Africa/Abidjan:
LOAD DATA LOCAL INFILE 'testData' INTO TABLE testDb.testTbl PROPERTIES ("timezone"="Africa/Abidjan");
You can specify the memory limit for the import by the exec_mem_limit parameter in PROPERTIES. In the following example, the memory limit is set to 10G:
LOAD DATA LOCAL INFILE 'testData' INTO TABLE testDb.testTbl PROPERTIES ("exec_mem_limit"="10737418240");
For more detailed syntax and best practices related to MySQL Load, refer to the MySQL Load command manual.