commit | 788d236c443eb4ced819f9305ed8e0460b5984b7 | [log] [tgz] |
---|---|---|
author | Prashant Wason <pwason@uber.com> | Wed Oct 07 05:33:51 2020 -0700 |
committer | GitHub <noreply@github.com> | Wed Oct 07 08:33:51 2020 -0400 |
tree | cb2acc040054a19fe8573cd67ed8015b2740a361 | |
parent | 524193eb4bd45eb21a64c68e927600a52b9ade1d [diff] |
[HUDI-1303] Some improvements for the HUDI Test Suite. (#2128) 1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs. 2. Fix CleanNode constructor which is not correctly implemented 3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts. 4. Fixed generation of records of specific size - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size. - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map. 5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp). 6. Making --workload-generator-classname an optional parameter as most probably the default will be used
Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals
. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).
Hudi supports three types of queries:
Learn more about Hudi at https://hudi.apache.org
Prerequisites for building Apache Hudi:
# Checkout code and build git clone https://github.com/apache/hudi.git && cd hudi mvn clean package -DskipTests # Start command spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
To build the Javadoc for all Java and Scala classes:
# Javadoc generated under target/site/apidocs mvn clean javadoc:aggregate -Pjavadocs
The default Scala version supported is 2.11. To build for Scala 2.12 version, build using scala-2.12
profile
mvn clean package -DskipTests -Dscala-2.12
The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using spark-shade-unbundle-avro
profile
# Checkout code and build git clone https://github.com/apache/hudi.git && cd hudi mvn clean package -DskipTests -Pspark-shade-unbundle-avro # Start command spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages org.apache.spark:spark-avro_2.11:2.4.4 \ --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
Unit tests can be run with maven profile unit-tests
.
mvn -Punit-tests test
Functional tests, which are tagged with @Tag("functional")
, can be run with maven profile functional-tests
.
mvn -Pfunctional-tests test
To run tests with spark event logging enabled, define the Spark event log directory. This allows visualizing test DAG and stages using Spark History Server UI.
mvn -Punit-tests test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
Please visit https://hudi.apache.org/docs/quick-start-guide.html to quickly explore Hudi's capabilities using spark-shell.