docs/data_node/load_node/hdfs.md

title: HDFS sidebar_position: 7

Overview

HDFS uses the general capabilities of flink's fileSystem to support single files and partitioned files. The file system connector itself is included in Flink and does not require an additional dependency. The corresponding jar can be found in the Flink distribution inside the /lib directory. A corresponding format needs to be specified for reading and writing rows from and to a file system.

How to create a HDFS Load Node

Usage for SQL API

The example below shows how to create a HDFS Load Node with Flink SQL Cli :

CREATE TABLE hdfs_load_node (
  id STRING,
  name STRING,
  uv BIGINT,
  pv BIGINT,
  dt STRING,
 `hour` STRING
  ) PARTITIONED BY (dt, `hour`) WITH (
    'connector'='filesystem-inlong',
    'path'='...',
    'format'='orc',
    'sink.partition-commit.delay'='1 h',
    'sink.partition-commit.policy.kind'='success-file'
  );

File Formats

Rolling Policy

Data within the partition directories are split into part files. Each partition will contain at least one part file for each subtask of the sink that has received data for that partition. The in-progress part file will be closed and additional part file will be created according to the configurable rolling policy. The policy rolls part files based on size, a timeout that specifies the maximum duration for which a file can be open.

File Compaction

The file sink supports file compactions, which allows applications to have smaller checkpoint intervals without generating a large number of files.

Partition Commit

After writing a partition, it is often necessary to notify downstream applications. For example, add the partition to a Hive metastore or writing a _SUCCESS file in the directory. The file system sink contains a partition commit feature that allows configuring custom policies. Commit actions are based on a combination of triggers and policies.

Partition Commit Policy

The partition strategy defines the specific operation of partition submission.

metastore：This strategy is only supported when hive.
success: The ‘_SUCCESS’ file will be generated after the part file is generated.