| commit | 1b27259b530225e3c76fda684f888cad78045d3c | [log] [tgz] |
|---|---|---|
| author | Danny Chan <yuzhao.cyz@gmail.com> | Sun Apr 25 23:06:53 2021 +0800 |
| committer | GitHub <noreply@github.com> | Sun Apr 25 23:06:53 2021 +0800 |
| tree | 2ee497db6dc5c8c97156b53fc5357b17215194ff | |
| parent | a5789c40673b36d40adab696706acc1446a286f8 [diff] |
[HUDI-1844] Add option to flush when total buckets memory exceeds the threshold (#2877) Current code supports flushing as per-bucket memory usage, while the buckets may still take too much memory for bootstrap from history data. When the threshold hits, flush out half of the buckets with bigger buffer size.
Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).
Hudi supports three types of queries:
Learn more about Hudi at https://hudi.apache.org
Prerequisites for building Apache Hudi:
# Checkout code and build git clone https://github.com/apache/hudi.git && cd hudi mvn clean package -DskipTests # Start command spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
To build the Javadoc for all Java and Scala classes:
# Javadoc generated under target/site/apidocs mvn clean javadoc:aggregate -Pjavadocs
The default Scala version supported is 2.11. To build for Scala 2.12 version, build using scala-2.12 profile
mvn clean package -DskipTests -Dscala-2.12
The default Spark version supported is 2.4.4. To build for Spark 3.0.0 version, build using spark3 profile
mvn clean package -DskipTests -Dspark3
The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using spark-shade-unbundle-avro profile
# Checkout code and build git clone https://github.com/apache/hudi.git && cd hudi mvn clean package -DskipTests -Pspark-shade-unbundle-avro # Start command spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages org.apache.spark:spark-avro_2.11:2.4.4 \ --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Unit tests can be run with maven profile unit-tests.
mvn -Punit-tests test
Functional tests, which are tagged with @Tag("functional"), can be run with maven profile functional-tests.
mvn -Pfunctional-tests test
To run tests with spark event logging enabled, define the Spark event log directory. This allows visualizing test DAG and stages using Spark History Server UI.
mvn -Punit-tests test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
Please visit https://hudi.apache.org/docs/quick-start-guide.html to quickly explore Hudi's capabilities using spark-shell.