commit | 8d0426826472f5b861b6bd7c43d8935917431467 | [log] [tgz] |
---|---|---|
author | Udit Mehrotra <uditme@amazon.com> | Wed Aug 12 17:51:31 2020 -0700 |
committer | GitHub <noreply@github.com> | Wed Aug 12 17:51:31 2020 -0700 |
tree | 15fa476d9dc34a38ff444cdf1918cd3ab1ead163 | |
parent | 8b928e9bca5ecac8f956b17e5d18f2deaf526253 [diff] |
[HUDI-1174] Changes for bootstrapped tables to work with presto (#1944) The purpose of this pull request is to implement changes required on Hudi side to get Bootstrapped tables integrated with Presto. The testing was done against presto 0.232 and following changes were identified to make it work: Annotation UseRecordReaderFromInputFormat is required on HoodieParquetInputFormat as well, because the reading for bootstrapped tables needs to happen through record reader to be able to perform the merge. On presto side, this annotation is already handled. We need to internally maintain VIRTUAL_COLUMN_NAMES because presto's internal hive version hive-apache-1.2.2 has VirutalColumn as a class, versus the one we depend on in hudi which is an enum. Dependency changes in hudi-presto-bundle to avoid runtime exceptions.
Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals
. Hudi manages the storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage).
Hudi supports three types of queries:
Learn more about Hudi at https://hudi.apache.org
Prerequisites for building Apache Hudi:
# Checkout code and build git clone https://github.com/apache/hudi.git && cd hudi mvn clean package -DskipTests -DskipITs # Start command spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
To build the Javadoc for all Java and Scala classes:
# Javadoc generated under target/site/apidocs mvn clean javadoc:aggregate -Pjavadocs
The default Scala version supported is 2.11. To build for Scala 2.12 version, build using scala-2.12
profile
mvn clean package -DskipTests -DskipITs -Dscala-2.12
The default hudi-jar bundles spark-avro module. To build without spark-avro module, build using spark-shade-unbundle-avro
profile
# Checkout code and build git clone https://github.com/apache/hudi.git && cd hudi mvn clean package -DskipTests -DskipITs -Pspark-shade-unbundle-avro # Start command spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages org.apache.spark:spark-avro_2.11:2.4.4 \ --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_2.11-*.*.*-SNAPSHOT.jar` \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
All tests can be run with maven
mvn test
To run tests with spark event logging enabled, define the Spark event log directory. This allows visualizing test DAG and stages using Spark History Server UI.
mvn test -DSPARK_EVLOG_DIR=/path/for/spark/event/log
Please visit https://hudi.apache.org/docs/quick-start-guide.html to quickly explore Hudi's capabilities using spark-shell.