csv/csv_with_names/csv_with_names_and_types/json/parquet/orc file formats.”, “sidebar_label”: “S3” }The S3 table-valued function (TVF) allows users to read and access files stored on S3-compatible object storage as if accessing relational table-formatted data. Currently supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc file formats.
S3( "uri" = "<uri>", "s3.access_key" = "<s3_access_key>", "s3.secret_key" = "<s3_secret_key>", "s3.region" = "<s3_region>", "s3.endpoint" = "<s3_endpoint>", "format" = "<format>" [, "<optional_property_key>" = "<optional_property_value>" [, ...] ] )
| Parameter | Description |
|---|---|
uri | URI for accessing S3. The function will use either Path Style or Virtual-hosted Style based on the use_path_style parameter |
s3.access_key | Access key for S3 |
s3.secret_key | Secret key for S3 |
s3.region | S3 region |
s3.endpoint | S3 endpoint address |
format | File format, supports csv/csv_with_names/csv_with_names_and_types/json/parquet/orc |
| Parameter | Default | Description | Notes |
|---|---|---|---|
s3.session_token | - | S3 session token | |
use_path_style | false | S3 SDK uses Virtual-hosted Style by default. Some object storage systems may not support Virtual-hosted Style. Set this parameter to force Path Style access. For example, MinIO only allows Path Style by default, so use_path_style=true should be set when accessing MinIO | |
force_parsing_by_standard_uri | false | Forces parsing of non-standard URIs as standard URIs | |
column_separator | \t | Column separator | |
line_delimiter | \n | Line delimiter | |
compress_type | UNKNOWN | Compression type. Currently supports UNKNOWN/PLAIN/GZ/LZO/BZ2/LZ4FRAME/DEFLATE/SNAPPYBLOCK, will auto-detect based on uri suffix | |
read_json_by_line | "true" | For JSON format import | JSON Load |
strip_outer_array | "false" | For JSON format import | JSON Load |
json_root | Empty | For JSON format import | JSON Load |
jsonpaths | Empty | For JSON format import | JSON Load |
num_as_string | false | For JSON format import | JSON Load |
fuzzy_parse | false | For JSON format import | |
trim_double_quotes | false | For CSV format import, removes the outermost double quotes from each field | |
skip_lines | 0 | Skips the first N lines of CSV file. This parameter is ineffective when using csv_with_names or csv_with_names_and_types format | |
path_partition_keys | - | Specifies partition column names in file path. For example, for path /path/to/city=beijing/date="2023-07-09", set path_partition_keys="city,date" to automatically read column names and values from path | |
resource | - | Specifies Resource name. S3 TVF can use existing S3 Resource to access S3 directly. For creating S3 Resource, refer to CREATE-RESOURCE. Supported from version 2.1.4. | |
enable_mapping_varbinary | false | Maps BYTE_ARRAY type to STRING when reading PARQUET/ORC. When enabled, maps to VARBINARY type instead. | Supported from version 4.0.3 |
For AWS S3, standard URI styles include:
AWS Client Style (Hadoop S3 Style)
s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88
Virtual Host Style
https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
Path Style
https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
Besides these three standard URI styles, other URI styles are also supported (though less common):
Virtual Host AWS Client (Hadoop S3) Mixed Style
s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
Path AWS Client (Hadoop S3) Mixed Style
s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
For detailed usage examples, see the examples below.
To directly query TVF or create a View based on this TVF, USAGE permission on the Resource is required. To query a View created based on TVF, only SELECT permission on the View is needed.
Read and access CSV format files on S3-compatible object storage
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style" = "true") order by c1;
Can be used with desc function
desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style" = "true");
Different URL Schema Usage
http://, https:// usage examples:
-- Note URI Bucket notation and use_path_style parameter setting, same for HTTP. -- Since "use_path_style"="true" is set, Path Style will be used to access S3. select * from s3( "uri" = "https://endpoint/bucket/file/student.csv", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint" = "endpoint", "s3.region" = "region", "format" = "csv", "use_path_style"="true"); -- Note URI Bucket notation and use_path_style parameter setting, same for HTTP. -- Since "use_path_style"="false" is set, Virtual-hosted Style will be used to access S3. select * from s3( "uri" = "https://bucket.endpoint/bucket/file/student.csv", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint" = "endpoint", "s3.region" = "region", "format" = "csv", "use_path_style"="false"); -- Alibaba Cloud OSS and Tencent Cloud COS use Virtual-hosted Style to access S3. -- OSS select * from s3( "uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint" = "oss-cn-beijing.aliyuncs.com", "s3.region" = "oss-cn-beijing", "format" = "parquet", "use_path_style" = "false"); -- COS select * from s3( "uri" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint" = "cos.ap-hongkong.myqcloud.com", "s3.region" = "ap-hongkong", "format" = "parquet", "use_path_style" = "false"); -- MinIO select * from s3( "uri" = "s3://bucket/file.csv", "s3.endpoint" = "http://172.21.0.101:9000", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.region" = "us-east-1", "format" = "csv" ); -- Baidu Cloud BOS uses S3-compatible Virtual-hosted Style to access S3. -- BOS select * from s3( "uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.region" = "bj", "s3.endpoint" = "http://bj.bcebos.com", "format" = "parquet", "use_path_style" = "false");
s3:// usage example:
-- Note URI Bucket notation, no need to set use_path_style parameter. -- Virtual-hosted Style will be used to access S3. select * from s3( "uri" = "s3://bucket/file/student.csv", "s3.endpoint"= "endpont", "s3.region"= "region", "s3.access_key"= "ak", "s3.secret_key" = "sk", "format" = "csv");
Other supported URI style examples:
-- Virtual Host AWS Client (Hadoop S3) Mixed Style. Use by setting use_path_style=false and force_parsing_by_standard_uri=true. select * from s3( "URI" = "s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint"= "endpont", "s3.region"= "region", "format" = "csv", "use_path_style"="false", "force_parsing_by_standard_uri"="true"); -- Path AWS Client (Hadoop S3) Mixed Style. Use by setting use_path_style=true and force_parsing_by_standard_uri=true. select * from s3( "URI" = "s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88", "s3.access_key"= "ak", "s3.secret_key" = "sk", "s3.endpoint"= "endpont", "s3.region"= "region", "format" = "csv", "use_path_style"="true", "force_parsing_by_standard_uri"="true");
CSV Format
Since S3 table-valued-function doesn't know the Table Schema in advance, it will read the file once to parse the Table Schema.
csv format: S3 table-valued-function reads files on S3 and processes them as CSV files, using the first line to parse Table Schema. The number of columns n in the first line will be used as the number of columns in Table Schema, and column names will be automatically named as c1, c2, ..., cn, with all column types set to String. Example:
student1.csv file content:
1,ftw,12 2,zs,18 3,ww,20
Using S3 TVF:
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style" = "true") order by c1; +------+------+------+ | c1 | c2 | c3 | +------+------+------+ | 1 | ftw | 12 | | 2 | zs | 18 | | 3 | ww | 20 | +------+------+------+
Can use desc function S3() to view Table Schema:
Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style" = "true"); +-------+------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------+------+-------+---------+-------+ | c1 | TEXT | Yes | false | NULL | NONE | | c2 | TEXT | Yes | false | NULL | NONE | | c3 | TEXT | Yes | false | NULL | NONE | +-------+------+------+-------+---------+-------+
csv_with_names Format
csv_with_names format: Parses the first line of the file as the number and names of columns in Table Schema, with all column types set to String. Example:
student_with_names.csv file content:
id,name,age 1,ftw,12 2,zs,18 3,ww,20
Using S3 TVF:
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv_with_names", "use_path_style" = "true") order by id; +------+------+------+ | id | name | age | +------+------+------+ | 1 | ftw | 12 | | 2 | zs | 18 | | 3 | ww | 20 | +------+------+------+
Similarly, use desc function S3() to view Table Schema
desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv_with_names", "use_path_style" = "true"); +-------+------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------+------+-------+---------+-------+ | id | TEXT | Yes | false | NULL | NONE | | name | TEXT | Yes | false | NULL | NONE | | age | TEXT | Yes | false | NULL | NONE | +-------+------+------+-------+---------+-------+
csv_with_names_and_types Format
csv_with_names_and_types format: Currently does not support parsing column types from CSV files. When using this format, S3 TVF will parse the first line as the number and names of columns in Table Schema, with all column types set to String, and will ignore the second line of the file.
student_with_names_and_types.csv file content:
id,name,age INT,STRING,INT 1,ftw,12 2,zs,18 3,ww,20
Using S3 TVF:
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv_with_names_and_types", "use_path_style" = "true") order by id; +------+------+------+ | id | name | age | +------+------+------+ | 1 | ftw | 12 | | 2 | zs | 18 | | 3 | ww | 20 | +------+------+------+
Similarly, use desc function S3() to view Table Schema:
Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv_with_names_and_types", "use_path_style" = "true"); +-------+------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------+------+-------+---------+-------+ | id | TEXT | Yes | false | NULL | NONE | | name | TEXT | Yes | false | NULL | NONE | | age | TEXT | Yes | false | NULL | NONE | +-------+------+------+-------+---------+-------+
JSON Format
json format: JSON format involves many optional parameters. For the meaning of each parameter, refer to: Json Load. When querying JSON format files with S3 TVF, it locates a JSON object based on json_root and jsonpaths parameters, uses the key in the object as column names in Table Schema, with all column types set to String. Example:
data.json file content:
[{"id":1, "name":"ftw", "age":18}]
[{"id":2, "name":"xxx", "age":17}]
[{"id":3, "name":"yyy", "age":19}]
Using S3 TVF to query:
select * from s3( "uri" = "http://127.0.0.1:9312/test2/data.json", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "json", "strip_outer_array" = "true", "read_json_by_line" = "true", "use_path_style"="true"); +------+------+------+ | id | name | age | +------+------+------+ | 1 | ftw | 18 | | 2 | xxx | 17 | | 3 | yyy | 19 | +------+------+------+ select * from s3( "uri" = "http://127.0.0.1:9312/test2/data.json", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "json", "strip_outer_array" = "true", "jsonpaths" = "[\"$.id\", \"$.age\"]", "use_path_style"="true"); +------+------+ | id | age | +------+------+ | 1 | 18 | | 2 | 17 | | 3 | 19 | +------+------+
Parquet Format
parquet format: S3 TVF supports parsing column names and types from Parquet files for Table Schema. Example:
select * from s3( "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "parquet", "use_path_style"="true") limit 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi | | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo | | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag | | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r | | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
MySQL [(none)]> desc function s3( "uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "parquet", "use_path_style"="true"); +---------------+--------------+------+-------+---------+-------+ | Field | Type | Null | Key | Default | Extra | +---------------+--------------+------+-------+---------+-------+ | p_partkey | INT | Yes | false | NULL | NONE | | p_name | TEXT | Yes | false | NULL | NONE | | p_mfgr | TEXT | Yes | false | NULL | NONE | | p_brand | TEXT | Yes | false | NULL | NONE | | p_type | TEXT | Yes | false | NULL | NONE | | p_size | INT | Yes | false | NULL | NONE | | p_container | TEXT | Yes | false | NULL | NONE | | p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE | | p_comment | TEXT | Yes | false | NULL | NONE | +---------------+--------------+------+-------+---------+-------+
ORC Format
orc format: Usage is the same as parquet format, just set the format parameter to orc.
select * from s3( "uri" = "http://127.0.0.1:9312/test2/test.snappy.orc", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "orc", "use_path_style"="true") limit 5; +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+ | 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi | | 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo | | 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag | | 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r | | 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully | +-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
URI with Wildcards
URI can use wildcards to read multiple files. Note: When using wildcards, ensure all files have the same format (especially csv, csv_with_names, csv_with_names_and_types are different formats). S3 TVF will use the first file to parse Table Schema.
With the following two CSV files:
// file1.csv 1,aaa,18 2,qqq,20 3,qwe,19 // file2.csv 5,cyx,19 6,ftw,21
You can use wildcards in URI to query:
select * from s3( "uri" = "http://127.0.0.1:9312/test2/file*.csv", "s3.access_key"= "minioadmin", "s3.secret_key" = "minioadmin", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "csv", "use_path_style"="true"); +------+------+------+ | c1 | c2 | c3 | +------+------+------+ | 1 | aaa | 18 | | 2 | qqq | 20 | | 3 | qwe | 19 | | 5 | cyx | 19 | | 6 | ftw | 21 | +------+------+------+
Using S3 TVF with insert into and cast
-- Create Doris internal table CREATE TABLE IF NOT EXISTS ${testTable} ( id int, name varchar(50), age int ) COMMENT "my first table" DISTRIBUTED BY HASH(id) BUCKETS 32 PROPERTIES("replication_num" = "1"); -- Insert data using S3 insert into ${testTable} (id,name,age) select cast (id as INT) as id, name, cast (age as INT) as age from s3( "uri" = "${uri}", "s3.access_key"= "${ak}", "s3.secret_key" = "${sk}", "s3.endpoint" = "http://127.0.0.1:9312", "s3.region" = "us-east-1", "format" = "${format}", "strip_outer_array" = "true", "read_json_by_line" = "true", "use_path_style" = "true");