blob: 9ce21944066b39798f53d6e65305fb0127aa26b1 [file] [log] [blame]
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
# File
> File sink connector
## Description
Output data to local or hdfs file.
:::tip
Engine Supported and plugin name
* [x] Spark: File
* [x] Flink: File
:::
## Options
<Tabs
groupId="engine-type"
defaultValue="spark"
values={[
{label: 'Spark', value: 'spark'},
{label: 'Flink', value: 'flink'},
]}>
<TabItem value="spark">
| name | type | required | default value |
| ---------------- | ------ | -------- | -------------- |
| options | object | no | - |
| partition_by | array | no | - |
| path | string | yes | - |
| path_time_format | string | no | yyyyMMddHHmmss |
| save_mode | string | no | error |
| serializer | string | no | json |
| common-options | string | no | - |
### options [object]
Custom parameters
### partition_by [array]
Partition data based on selected fields
### path [string]
The file path is required. The `hdfs file` starts with `hdfs://` , and the `local file` starts with `file://`,
we can add the variable `${now}` or `${uuid}` in the path, like `hdfs:///test_${uuid}_${now}.txt`,
`${now}` represents the current time, and its format can be defined by specifying the option `path_time_format`
### path_time_format [string]
When the format in the `path` parameter is `xxxx-${now}` , `path_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:
| Symbol | Description |
| ------ | ------------------ |
| y | Year |
| M | Month |
| d | Day of month |
| H | Hour in day (0-23) |
| m | Minute in hour |
| s | Second in minute |
See [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) for detailed time format syntax.
### save_mode [string]
Storage mode, currently supports `overwrite` , `append` , `ignore` and `error` . For the specific meaning of each mode, see [save-modes](https://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes)
### serializer [string]
Serialization method, currently supports `csv` , `json` , `parquet` , `orc` and `text`
### common options [string]
Sink plugin common parameters, please refer to [Sink Plugin](common-options.md) for details
</TabItem>
<TabItem value="flink">
| name | type | required | default value |
|-------------------|--------| -------- |----------------|
| format | string | yes | - |
| path | string | yes | - |
| path_time_format | string | no | yyyyMMddHHmmss |
| write_mode | string | no | - |
| common-options | string | no | - |
| parallelism | int | no | - |
| rollover_interval | long | no | 1 |
| max_part_size | long | no | 1024 |
| prefix | string | no | seatunnel |
| suffix | string | no | .ext |
### format [string]
Currently, `csv` , `json` , and `text` are supported. The streaming mode currently only supports `text`
### path [string]
The file path is required. The `hdfs file` starts with `hdfs://` , and the `local file` starts with `file://`,
we can add the variable `${now}` or `${uuid}` in the path, like `hdfs:///test_${uuid}_${now}.txt`,
`${now}` represents the current time, and its format can be defined by specifying the option `path_time_format`
### path_time_format [string]
When the format in the `path` parameter is `xxxx-${now}` , `path_time_format` can specify the time format of the path, and the default value is `yyyy.MM.dd` . The commonly used time formats are listed as follows:
| Symbol | Description |
| ------ | ------------------ |
| y | Year |
| M | Month |
| d | Day of month |
| H | Hour in day (0-23) |
| m | Minute in hour |
| s | Second in minute |
See [Java SimpleDateFormat](https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) for detailed time format syntax.
### write_mode [string]
- NO_OVERWRITE
- No overwrite, there is an error in the path
- OVERWRITE
- Overwrite, delete and then write if the path exists
### common options [string]
Sink plugin common parameters, please refer to [Sink Plugin](common-options.md) for details
### parallelism [`Int`]
The parallelism of an individual operator, for FileSink
### rollover_interval [long]
The new file part rollover interval, unit min.
### max_part_size [long]
The max size of each file part, unit MB.
### prefix [string]
The prefix of each file part.
### suffix [string]
The suffix of each file part.
</TabItem>
</Tabs>
## Example
<Tabs
groupId="engine-type"
defaultValue="spark"
values={[
{label: 'Spark', value: 'spark'},
{label: 'Flink', value: 'flink'},
]}>
<TabItem value="spark">
```bash
file {
path = "file:///var/logs"
serializer = "text"
}
```
</TabItem>
<TabItem value="flink">
```bash
FileSink {
format = "json"
path = "hdfs://localhost:9000/flink/output/"
write_mode = "OVERWRITE"
}
```
</TabItem>
</Tabs>