Hdfs file source connector
Read data from hdfs file system.
Read all the data in a split in a pollNext call. What splits are read will be saved in snapshot.
name | type | required | default value |
---|---|---|---|
path | string | yes | - |
type | string | yes | - |
fs.defaultFS | string | yes | - |
delimiter | string | no | \001 |
parse_partition_from_path | boolean | no | true |
date_format | string | no | yyyy-MM-dd |
datetime_format | string | no | yyyy-MM-dd HH:mm:ss |
time_format | string | no | HH:mm:ss |
schema | config | no | - |
common-options | no | - |
The source file path.
Field delimiter, used to tell connector how to slice and dice fields when reading text files
default \001
, the same as hive's default delimiter
Control whether parse the partition keys and values from file path
For example if you read a file from path hdfs://hadoop-cluster/tmp/seatunnel/parquet/name=tyrantlucifer/age=26
Every record data from file will be added these two fields:
name | age |
---|---|
tyrantlucifer | 26 |
Tips: Do not define partition fields in schema option
Date type format, used to tell connector how to convert string to date, supported as the following formats:
yyyy-MM-dd
yyyy.MM.dd
yyyy/MM/dd
default yyyy-MM-dd
Datetime type format, used to tell connector how to convert string to datetime, supported as the following formats:
yyyy-MM-dd HH:mm:ss
yyyy.MM.dd HH:mm:ss
yyyy/MM/dd HH:mm:ss
yyyyMMddHHmmss
default yyyy-MM-dd HH:mm:ss
Time type format, used to tell connector how to convert string to time, supported as the following formats:
HH:mm:ss
HH:mm:ss.SSS
default HH:mm:ss
File type, supported as the following file types:
text
csv
parquet
orc
json
If you assign file type to json
, you should also assign schema option to tell connector how to parse data to the row you want.
For example:
upstream data is the following:
{"code": 200, "data": "get success", "success": true}
You can also save multiple pieces of data in one file and split them by newline:
{"code": 200, "data": "get success", "success": true} {"code": 300, "data": "get failed", "success": false}
you should assign schema as the following:
schema { fields { code = int data = string success = boolean } }
connector will generate data as the following:
code | data | success |
---|---|---|
200 | get success | true |
If you assign file type to parquet
orc
, schema option not required, connector can find the schema of upstream data automatically.
If you assign file type to text
csv
, you can choose to specify the schema information or not.
For example, upstream data is the following:
tyrantlucifer#26#male
If you do not assign data schema connector will treat the upstream data as the following:
content |
---|
tyrantlucifer#26#male |
If you assign data schema, you should also assign the option delimiter
too except CSV file type
you should assign schema and delimiter as the following:
delimiter = "#" schema { fields { name = string age = int gender = string } }
connector will generate data as the following:
name | age | gender |
---|---|---|
tyrantlucifer | 26 | male |
Hdfs cluster address.
the schema fields of upstream data
Source plugin common parameters, please refer to Source Common Options for details.
HdfsFile { path = "/apps/hive/demo/student" type = "parquet" fs.defaultFS = "hdfs://namenode001" }
HdfsFile { schema { fields { name = string age = int } } path = "/apps/hive/demo/student" type = "json" fs.defaultFS = "hdfs://namenode001" }