HDFS uses the general capabilities of flink's fileSystem to support single files and partitioned files. The file system connector itself is included in Flink and does not require an additional dependency. The corresponding jar can be found in the Flink distribution inside the /lib directory. A corresponding format needs to be specified for reading and writing rows from and to a file system.
The example below shows how to create a HDFS Load Node with Flink SQL Cli
:
CREATE TABLE hdfs_load_node ( id STRING, name STRING, uv BIGINT, pv BIGINT, dt STRING, `hour` STRING ) PARTITIONED BY (dt, `hour`) WITH ( 'connector'='filesystem-inlong', 'path'='...', 'format'='orc', 'sink.partition-commit.delay'='1 h', 'sink.partition-commit.policy.kind'='success-file' );
Data within the partition directories are split into part files. Each partition will contain at least one part file for each subtask of the sink that has received data for that partition. The in-progress part file will be closed and additional part file will be created according to the configurable rolling policy. The policy rolls part files based on size, a timeout that specifies the maximum duration for which a file can be open.
The file sink supports file compactions, which allows applications to have smaller checkpoint intervals without generating a large number of files.
After writing a partition, it is often necessary to notify downstream applications. For example, add the partition to a Hive metastore or writing a _SUCCESS file in the directory. The file system sink contains a partition commit feature that allows configuring custom policies. Commit actions are based on a combination of triggers and policies.
The partition strategy defines the specific operation of partition submission.