tree: 2b087f5f6c4fc055914312903826763adbb8093d [path history] [tgz]
  1. src/
  2. pom.xml
  3. README.md
sdks/java/io/hdfs/README.md

HDFS IO

This library provides HDFS sources and sinks to make it possible to read and write Apache Hadoop file formats from Apache Beam pipelines.

Currently, only the read path is implemented. A HDFSFileSource allows any Hadoop FileInputFormat to be read as a PCollection.

A HDFSFileSource can be read from using the org.apache.beam.sdk.io.Read transform. For example:

HDFSFileSource<K, V> source = HDFSFileSource.from(path, MyInputFormat.class,
  MyKey.class, MyValue.class);
PCollection<KV<MyKey, MyValue>> records = Read.from(mySource);

Alternatively, the readFrom method is a convenience method that returns a read transform. For example:

PCollection<KV<MyKey, MyValue>> records = HDFSFileSource.readFrom(path,
  MyInputFormat.class, MyKey.class, MyValue.class);