commit | 6461927eac3f8a225b17af5ecb6ace8c9cf1757b | [log] [tgz] |
---|---|---|
author | Prashant Wason <pwason@uber.com> | Mon Aug 31 08:05:59 2020 -0700 |
committer | GitHub <noreply@github.com> | Mon Aug 31 08:05:59 2020 -0700 |
tree | 97b4bf07cbc977d143f82a7dd567a17525e0421d | |
parent | 6df8f88d86a20bbabeb0acc1103376ac6f461df6 [diff] |
[HUDI-960] Implementation of the HFile base and log file format. (#1804) * [HUDI-960] Implementation of the HFile base and log file format. 1. Includes HFileWriter and HFileReader 2. Includes HFileInputFormat for both snapshot and realtime input format for Hive 3. Unit test for new code 4. IT for using HFile format and querying using Hive (Presto and SparkSQL are not supported) Advantage: HFile file format saves data as binary key-value pairs. This implementation chooses the following values: 1. Key = Hoodie Record Key (as bytes) 2. Value = Avro encoded GenericRecord (as bytes) HFile allows efficient lookup of a record by key or range of keys. Hence, this base file format is well suited to applications like RFC-15, RFC-08 which will benefit from the ability to lookup records by key or search in a range of keys without having to read the entire data/log format. Limitations: HFile storage format has certain limitations when used as a general purpose data storage format. 1. Does not have a implemented reader for Presto and SparkSQL 2. Is not a columnar file format and hence may lead to lower compression levels and greater IO on query side due to lack of column pruning Other changes: - Remove databricks/avro from pom - Fix HoodieClientTestUtils from not using scala imports/reflection based conversion etc - Breaking up limitFileSize(), per parquet and hfile base files - Added three new configs for HoodieHFileConfig - prefetchBlocksOnOpen, cacheDataInL1, dropBehindCacheCompaction - Throw UnsupportedException in HFileReader.getRecordKeys() - Updated HoodieCopyOnWriteTable to create the correct merge handle (HoodieSortedMergeHandle for HFile and HoodieMergeHandle otherwise) * Fixing checkstyle Co-authored-by: Vinoth Chandar <vinoth@apache.org>
Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals
. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).
Hudi supports three types of queries:
Learn more about Hudi at https://hudi.apache.org
Prerequisites for building Apache Hudi:
# Checkout code and build git clone https://github.com/apache/hudi.git && cd hudi mvn clean package -DskipTests # Start command spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
To build the Javadoc for all Java and Scala classes:
# Javadoc generated under target/site/apidocs mvn clean javadoc:aggregate -Pjavadocs
The default Scala version supported is 2.11. To build for Scala 2.12 version, build using scala-2.12
profile
mvn clean package -DskipTests -Dscala-2.12
The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using spark-shade-unbundle-avro
profile
# Checkout code and build git clone https://github.com/apache/hudi.git && cd hudi mvn clean package -DskipTests -Pspark-shade-unbundle-avro # Start command spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages org.apache.spark:spark-avro_2.11:2.4.4 \ --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Unit tests can be run with maven profile unit-tests
.
mvn -Punit-tests test
Functional tests, which are tagged with @Tag("functional")
, can be run with maven profile functional-tests
.
mvn -Pfunctional-tests test
To run tests with spark event logging enabled, define the Spark event log directory. This allows visualizing test DAG and stages using Spark History Server UI.
mvn -Punit-tests test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
Please visit https://hudi.apache.org/docs/quick-start-guide.html to quickly explore Hudi's capabilities using spark-shell.