tree: 7caff183e525404e6420c70e6e6735c4e0ad3003 [path history] [tgz]
  1. HoodiePySparkQuickstart.py
  2. README.md
hudi-examples/hudi-examples-spark/src/test/python/README.md

Requirements

Python is required to run this. Pyspark 2.4.7 does not work with the latest versions of python (python 3.8+) so if you want to use a later version (in the example below 3.3) you can build Hudi by using the command:

cd $HUDI_DIR
mvn clean install -DskipTests -Dspark3.3 -Dscala2.12 

Various python packages may also need to be installed so you should get pip and then use pip install <package name> to get them

How to Run

  1. Download pyspark
  2. Extract it where you want it to be installed and note that location
  3. Run(or add to .bashrc) the following and make sure that you put in the correct path for SPARK_HOME
export SPARK_HOME=/path/to/spark/home
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/*.zip:$PYTHONPATH
  1. Identify the Hudi Spark Bundle .jar or package that you wish to use: A package will be in the format org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.0 A jar will be in the format [HUDI_BASE_PATH]/packaging/hudi-spark-bundle/target/hudi-spark-bundle[VERSION].jar
  2. Go to the hudi directory and run the quickstart examples using the commands below, using the -t flag for the table name and the -p flag or -j flag for your package or jar respectively.
cd $HUDI_DIR
python3 hudi-examples/hudi-examples-spark/src/test/python/HoodiePySparkQuickstart.py [-h] -t TABLE (-p PACKAGE | -j JAR)