blob: 07a2ceac3bed0e608406d93ed064833c06519c0e [file] [view]
---
{
"title": "S3 | Table Valued Functions",
"language": "en",
"description": "The S3 table-valued function (TVF) allows users to read and access files stored on S3-compatible object storage as if accessing relational table-formatted data. Currently supports `csv/csv_with_names/csv_with_names_and_types/json/parquet/orc` file formats.",
"sidebar_label": "S3"
}
---
# S3
## Description
The S3 table-valued function (TVF) allows users to read and access files stored on S3-compatible object storage as if accessing relational table-formatted data. Currently supports `csv/csv_with_names/csv_with_names_and_types/json/parquet/orc` file formats.
## Syntax
```sql
S3(
"uri" = "<uri>",
"s3.access_key" = "<s3_access_key>",
"s3.secret_key" = "<s3_secret_key>",
"s3.region" = "<s3_region>",
"s3.endpoint" = "<s3_endpoint>",
"format" = "<format>"
[, "<optional_property_key>" = "<optional_property_value>" [, ...] ]
)
```
## Required Parameters
| Parameter | Description |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------|
| `uri` | URI for accessing S3. The function will use either Path Style or Virtual-hosted Style based on the `use_path_style` parameter |
| `s3.access_key` | Access key for S3 |
| `s3.secret_key` | Secret key for S3 |
| `s3.region` | S3 region |
| `s3.endpoint` | S3 endpoint address |
| `format` | File format, supports `csv/csv_with_names/csv_with_names_and_types/json/parquet/orc` |
## Optional Parameters
| Parameter | Default | Description | Notes |
|---------------------------------|---------|-----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|
| `s3.session_token` | - | S3 session token | |
| `use_path_style` | `false` | S3 SDK uses Virtual-hosted Style by default. Some object storage systems may not support Virtual-hosted Style. Set this parameter to force Path Style access. For example, MinIO only allows Path Style by default, so `use_path_style=true` should be set when accessing MinIO | |
| `force_parsing_by_standard_uri` | `false` | Forces parsing of non-standard URIs as standard URIs | |
| `column_separator` | `\t` | Column separator | |
| `line_delimiter` | `\n` | Line delimiter | |
| `compress_type` | `UNKNOWN` | Compression type. Currently supports `UNKNOWN`/`PLAIN`/`GZ`/`LZO`/`BZ2`/`LZ4FRAME`/`DEFLATE`/`SNAPPYBLOCK`, will auto-detect based on `uri` suffix | |
| `read_json_by_line` | `"true"` | For JSON format import | [JSON Load](../../../data-operate/import/file-format/json) |
| `strip_outer_array` | `"false"` | For JSON format import | [JSON Load](../../../data-operate/import/file-format/json) |
| `json_root` | Empty | For JSON format import | [JSON Load](../../../data-operate/import/file-format/json) |
| `jsonpaths` | Empty | For JSON format import | [JSON Load](../../../data-operate/import/file-format/json) |
| `num_as_string` | `false` | For JSON format import | [JSON Load](../../../data-operate/import/file-format/json) |
| `fuzzy_parse` | `false` | For JSON format import | |
| `trim_double_quotes` | `false` | For CSV format import, removes the outermost double quotes from each field | |
| `skip_lines` | `0` | Skips the first N lines of CSV file. This parameter is ineffective when using `csv_with_names` or `csv_with_names_and_types` format | |
| `path_partition_keys` | - | Specifies partition column names in file path. For example, for path `/path/to/city=beijing/date="2023-07-09"`, set `path_partition_keys="city,date"` to automatically read column names and values from path | |
| `resource` | - | Specifies Resource name. S3 TVF can use existing S3 Resource to access S3 directly. For creating S3 Resource, refer to [CREATE-RESOURCE](../../sql-statements/cluster-management/compute-management/CREATE-RESOURCE). Supported from version 2.1.4. | |
| `enable_mapping_varbinary` | `false` | Maps BYTE_ARRAY type to STRING when reading PARQUET/ORC. When enabled, maps to VARBINARY type instead. | Supported from version 4.0.3 |
## Notes
1. For AWS S3, standard URI styles include:
- AWS Client Style (Hadoop S3 Style)
```
s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88
```
- Virtual Host Style
```
https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
```
- Path Style
```
https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
```
Besides these three standard URI styles, other URI styles are also supported (though less common):
- Virtual Host AWS Client (Hadoop S3) Mixed Style
```
s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
```
- Path AWS Client (Hadoop S3) Mixed Style
```
s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
```
For detailed usage examples, see the examples below.
2. To directly query TVF or create a View based on this TVF, USAGE permission on the Resource is required. To query a View created based on TVF, only SELECT permission on the View is needed.
## Examples
- Read and access CSV format files on S3-compatible object storage
```sql
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true") order by c1;
```
- Can be used with `desc function`
```sql
desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true");
```
- **Different URL Schema Usage**
`http://`, `https://` usage examples:
```sql
-- Note URI Bucket notation and use_path_style parameter setting, same for HTTP.
-- Since "use_path_style"="true" is set, Path Style will be used to access S3.
select * from s3(
"uri" = "https://endpoint/bucket/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "endpoint",
"s3.region" = "region",
"format" = "csv",
"use_path_style"="true");
-- Note URI Bucket notation and use_path_style parameter setting, same for HTTP.
-- Since "use_path_style"="false" is set, Virtual-hosted Style will be used to access S3.
select * from s3(
"uri" = "https://bucket.endpoint/bucket/file/student.csv",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "endpoint",
"s3.region" = "region",
"format" = "csv",
"use_path_style"="false");
-- Alibaba Cloud OSS and Tencent Cloud COS use Virtual-hosted Style to access S3.
-- OSS
select * from s3(
"uri" = "http://example-bucket.oss-cn-beijing.aliyuncs.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "oss-cn-beijing.aliyuncs.com",
"s3.region" = "oss-cn-beijing",
"format" = "parquet",
"use_path_style" = "false");
-- COS
select * from s3(
"uri" = "https://example-bucket.cos.ap-hongkong.myqcloud.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint" = "cos.ap-hongkong.myqcloud.com",
"s3.region" = "ap-hongkong",
"format" = "parquet",
"use_path_style" = "false");
-- MinIO
select * from s3(
"uri" = "s3://bucket/file.csv",
"s3.endpoint" = "http://172.21.0.101:9000",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.region" = "us-east-1",
"format" = "csv"
);
-- Baidu Cloud BOS uses S3-compatible Virtual-hosted Style to access S3.
-- BOS
select * from s3(
"uri" = "https://example-bucket.s3.bj.bcebos.com/your-folder/file.parquet",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.region" = "bj",
"s3.endpoint" = "http://bj.bcebos.com",
"format" = "parquet",
"use_path_style" = "false");
```
- `s3://` usage example:
```sql
-- Note URI Bucket notation, no need to set use_path_style parameter.
-- Virtual-hosted Style will be used to access S3.
select * from s3(
"uri" = "s3://bucket/file/student.csv",
"s3.endpoint"= "endpont",
"s3.region"= "region",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"format" = "csv");
```
- Other supported URI style examples:
```sql
-- Virtual Host AWS Client (Hadoop S3) Mixed Style. Use by setting use_path_style=false and force_parsing_by_standard_uri=true.
select * from s3(
"URI" = "s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint"= "endpont",
"s3.region"= "region",
"format" = "csv",
"use_path_style"="false",
"force_parsing_by_standard_uri"="true");
-- Path AWS Client (Hadoop S3) Mixed Style. Use by setting use_path_style=true and force_parsing_by_standard_uri=true.
select * from s3(
"URI" = "s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88",
"s3.access_key"= "ak",
"s3.secret_key" = "sk",
"s3.endpoint"= "endpont",
"s3.region"= "region",
"format" = "csv",
"use_path_style"="true",
"force_parsing_by_standard_uri"="true");
```
- **CSV Format**
Since S3 table-valued-function doesn't know the Table Schema in advance, it will read the file once to parse the Table Schema.
`csv` format: S3 table-valued-function reads files on S3 and processes them as CSV files, using the first line to parse Table Schema. The number of columns `n` in the first line will be used as the number of columns in Table Schema, and column names will be automatically named as `c1, c2, ..., cn`, with all column types set to `String`. Example:
`student1.csv` file content:
```
1,ftw,12
2,zs,18
3,ww,20
```
Using S3 TVF:
```sql
select * from s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true") order by c1;
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+
```
Can use `desc function S3()` to view Table Schema:
```sql
Desc function s3("uri" = "http://127.0.0.1:9312/test2/student1.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| c1 | TEXT | Yes | false | NULL | NONE |
| c2 | TEXT | Yes | false | NULL | NONE |
| c3 | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+
```
- **`csv_with_names` Format**
`csv_with_names` format: Parses the first line of the file as the number and names of columns in Table Schema, with all column types set to `String`. Example:
`student_with_names.csv` file content:
```
id,name,age
1,ftw,12
2,zs,18
3,ww,20
```
Using S3 TVF:
```sql
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names",
"use_path_style" = "true") order by id;
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+
```
- Similarly, use `desc function S3()` to view Table Schema
```sql
desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names",
"use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+
```
- **`csv_with_names_and_types` Format**
`csv_with_names_and_types` format: Currently does not support parsing column types from CSV files. When using this format, S3 TVF will parse the first line as the number and names of columns in Table Schema, with all column types set to String, and will ignore the second line of the file.
`student_with_names_and_types.csv` file content:
```
id,name,age
INT,STRING,INT
1,ftw,12
2,zs,18
3,ww,20
```
Using S3 TVF:
```sql
select * from s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names_and_types",
"use_path_style" = "true") order by id;
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 12 |
| 2 | zs | 18 |
| 3 | ww | 20 |
+------+------+------+
```
Similarly, use `desc function S3()` to view Table Schema:
```sql
Desc function s3("uri" = "http://127.0.0.1:9312/test2/student_with_names_and_types.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv_with_names_and_types",
"use_path_style" = "true");
+-------+------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-------+---------+-------+
| id | TEXT | Yes | false | NULL | NONE |
| name | TEXT | Yes | false | NULL | NONE |
| age | TEXT | Yes | false | NULL | NONE |
+-------+------+------+-------+---------+-------+
```
- **JSON Format**
`json` format: JSON format involves many optional parameters. For the meaning of each parameter, refer to: [Json Load](../../../data-operate/import/file-format/json). When querying JSON format files with S3 TVF, it locates a JSON object based on `json_root` and `jsonpaths` parameters, uses the `key` in the object as column names in Table Schema, with all column types set to String. Example:
`data.json` file content:
```
[{"id":1, "name":"ftw", "age":18}]
[{"id":2, "name":"xxx", "age":17}]
[{"id":3, "name":"yyy", "age":19}]
```
Using S3 TVF to query:
```sql
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "json",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style"="true");
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 18 |
| 2 | xxx | 17 |
| 3 | yyy | 19 |
+------+------+------+
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/data.json",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "json",
"strip_outer_array" = "true",
"jsonpaths" = "[\"$.id\", \"$.age\"]",
"use_path_style"="true");
+------+------+
| id | age |
+------+------+
| 1 | 18 |
| 2 | 17 |
| 3 | 19 |
+------+------+
```
- **Parquet Format**
`parquet` format: S3 TVF supports parsing column names and types from Parquet files for Table Schema. Example:
```sql
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "parquet",
"use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
```
```sql
MySQL [(none)]> desc function s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "parquet",
"use_path_style"="true");
+---------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey | INT | Yes | false | NULL | NONE |
| p_name | TEXT | Yes | false | NULL | NONE |
| p_mfgr | TEXT | Yes | false | NULL | NONE |
| p_brand | TEXT | Yes | false | NULL | NONE |
| p_type | TEXT | Yes | false | NULL | NONE |
| p_size | INT | Yes | false | NULL | NONE |
| p_container | TEXT | Yes | false | NULL | NONE |
| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
| p_comment | TEXT | Yes | false | NULL | NONE |
+---------------+--------------+------+-------+---------+-------+
```
- **ORC Format**
`orc` format: Usage is the same as `parquet` format, just set the `format` parameter to `orc`.
```sql
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/test.snappy.orc",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "orc",
"use_path_style"="true") limit 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
```
- **URI with Wildcards**
URI can use wildcards to read multiple files. Note: When using wildcards, ensure all files have the same format (especially `csv`, `csv_with_names`, `csv_with_names_and_types` are different formats). S3 TVF will use the first file to parse Table Schema.
With the following two CSV files:
```
// file1.csv
1,aaa,18
2,qqq,20
3,qwe,19
// file2.csv
5,cyx,19
6,ftw,21
```
You can use wildcards in URI to query:
```sql
select * from s3(
"uri" = "http://127.0.0.1:9312/test2/file*.csv",
"s3.access_key"= "minioadmin",
"s3.secret_key" = "minioadmin",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "csv",
"use_path_style"="true");
+------+------+------+
| c1 | c2 | c3 |
+------+------+------+
| 1 | aaa | 18 |
| 2 | qqq | 20 |
| 3 | qwe | 19 |
| 5 | cyx | 19 |
| 6 | ftw | 21 |
+------+------+------+
```
- **Using S3 TVF with `insert into` and `cast`**
```sql
-- Create Doris internal table
CREATE TABLE IF NOT EXISTS ${testTable}
(
id int,
name varchar(50),
age int
)
COMMENT "my first table"
DISTRIBUTED BY HASH(id) BUCKETS 32
PROPERTIES("replication_num" = "1");
-- Insert data using S3
insert into ${testTable} (id,name,age)
select cast (id as INT) as id, name, cast (age as INT) as age
from s3(
"uri" = "${uri}",
"s3.access_key"= "${ak}",
"s3.secret_key" = "${sk}",
"s3.endpoint" = "http://127.0.0.1:9312",
"s3.region" = "us-east-1",
"format" = "${format}",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style" = "true");
```