Description

An extension to FsDataWriter that writes in Parquet format in the form of either Avro, Protobuf or ParquetGroup. This implementation allows users to specify the CodecFactory to use through the configuration property writer.codec.type. By default, the snappy codec is used. See Developer Notes to make sure you are using the right Gobblin jar.

Usage

writer.builder.class=org.apache.gobblin.writer.ParquetDataWriterBuilder
writer.destination.type=HDFS
writer.output.format=PARQUET

Example Pipeline Configuration

Configuration

KeyDescriptionDefault ValueRequired
writer.parquet.page.sizeThe page size threshold.1048576No
writer.parquet.dictionary.page.sizeThe block size threshold for the dictionary pages.134217728No
writer.parquet.dictionaryTo turn dictionary encoding on. Parquet has a dictionary encoding for data with a small number of unique values ( < 10^5 ) that aids in significant compression and boosts processing speed.trueNo
writer.parquet.validateTo turn on validation using the schema. This validation is done by ParquetWriter not by Gobblin.falseNo
writer.parquet.versionVersion of parquet writer to use. Available versions are v1 and v2.v1No
writer.parquet.formatIn-memory format of the record being written to Parquet. Options are AVRO, PROTOBUF and GROUPGROUPNo

Developer Notes

Gobblin provides integration with two different versions of Parquet through its modules. Use the appropriate jar based on the Parquet library you use in your code.

JarDependencyGobblin Release
gobblin-parquetcom.twitter:parquet-hadoop-bundle>= 0.12.0
gobblin-parquet-apacheorg.apache.parquet:parquet-hadoop>= 0.15.0

If you want to look at the code, check out:

ModuleFile
gobblin-parquetParquetHdfsDataWriter
gobblin-parquetParquetDataWriterBuilder
gobblin-parquet-apacheParquetHdfsDataWriter
gobblin-parquet-apacheParquetDataWriterBuilder