There are some common questions about developing, debugging and testing been asked again and again. In order to help the developers to contribute to Gluten as soon as possible, we collected these frequently asked questions, and organized them in the form of Q&A. It's convenient for the developers to check and learn.
When you encountered a new problem and then resolved it, please add a new item to this document if you think it may be helpful to the other developers.
We use ${GLUTEN_HOME} to represent the home directory of Gluten in this document.
The Gluten worked as the role of bridge, it's a middle layer between the Spark and the native execution library.
The Gluten is responsible for validating whether the operators of the Spark plan can be executed by the native engine or not. If yes, the Gluten transforms Spark plan to Substrait plan, and then send the Substrait plan to the native engine.
The Gluten codes consist of two parts: the C++ codes and the Java/Scala codes.
${GLUTEN_HOME}/cpp, the Java/Scala codes are located elsewhere.JNI is a programming technology of invoking C++ from Java. All JNI interfaces are defined in the file JniWrapper.cc under the directory jni.
If you don't concern about the Scala/Java codes and just want to debug the C++ codes executed in native engine, you may debug the C++ via benchmarks with GDB.
To debug C++, you have to generate the example files, the example files consist of:
You can generate the example files by the following steps:
${GLUTEN_HOME}/dev/builddeps-veloxbe.sh --build_tests=ON --build_benchmarks=ON --build_examples=ON --build_type=Debug
--build_type=Debug is good for debugging.generic_benchmark will be generated under the directory of gluten_home/cpp/build/velox/benchmarks/.cd ${GLUTEN_HOME}
mvn test -Pspark-3.2 -Pbackends-velox -pl backends-velox \
-am -DtagsToInclude="org.apache.gluten.tags.GenerateExample" \
-Dtest=none -DfailIfNoTests=false \
-Dexec.skip
${GLUTEN_HOME}/backends-veloxtree ${GLUTEN_HOME}/backends-velox/generated-native-benchmark/-Pspark-3.2 with -Pspark-3.3 if your spark's version is 3.3$ tree ${GLUTEN_HOME}/backends-velox/generated-native-benchmark/ /some-dir-to-gluten-home/backends-velox/generated-native-benchmark/ |-- conf_12_0.ini |-- data_12_0_0.parquet |-- data_12_0_1.parquet `-- plan_12_0.json
cd ${GLUTEN_HOME} gdb cpp/build/velox/benchmarks/generic_benchmark
generic_benchmark successfully, you can set breakpoint on the main function with command b main, and then run using the r command with arguments for the example files like:r --with-shuffle --partitioning hash --threads 1 --iterations 1 \ --conf backends-velox/generated-native-benchmark/conf_12_0.ini \ --plan backends-velox/generated-native-benchmark/plan_12_0.json \ --data backends-velox/generated-native-benchmark/data_12_0_0.parquet,backends-velox/generated-native-benchmark/data_12_0_1.parquetThe process
generic_benchmark will start and stop at the main function.p variable_name, or execute the program line by line with command n, or step-in the function been called with command s.generic_benchmark with any gdb commands as debugging normal C++ program, because the generic_benchmark is a pure C++ executable file in fact.gdb-tui is a valuable feature and is worth trying. You can get more help from the online docs. gdb-tui
You can start generic_benchmark with specific JSON plan and input files
plan_12_0.json to custom the Substrait plan or specify the inputs files placed in the other directory.Gluten will validate generated plan before execute it, and validation usually happens in native side, so we provide a utility to help debug validation process in native side.
spark.gluten.sql.debug=true, and you will find generated plan be printed in stderr with json format, save it as plan.json for example.--build_benchmarks=ON, then check plan_validator_util executable file in ${GLUTEN_HOME}/cpp/build/velox/benchmarks/../plan_validator_util <path>/plan.jsonTo debug some runtime issues in Scala/Java, we recommend developers to use Intellij remote debug, see tutorial link.
According to your setting for Intellij remote debug, please set SPARK_SUBMIT_OPTS in the environment where spark-submit is executed. See the below example.
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008
To run a Scala/Java test class, you can use the below mvn command (take Velox backend as example), which is helpful to debug some unit test failure reported by Gluten CI.
mvn test -Pspark-3.5 -Pspark-ut -Pbackends-velox -DargLine="-Dspark.test.home=/path/to/spark/source/code/home/" -DwildcardSuites=xxx
Please set wildcardSuites with a fully qualified class name. spark.test.home is optional to set. It is only required for some test suites to use Spark resources. If you are specifying the spark.test.home arg, it should be set to either:
install-spark-resources.sh script to get a directory with the necessary resource files:# Define a directory to use for the Spark files and the Spark version
export spark_dir=/tmp/spark
export spark_version=3.5
# Run the install-spark-resources.sh script
.github/workflows/util/install-spark-resources.sh ${spark_version} ${spark_dir}
After running the install-spark-resources.sh, define the spark.test.home directory like: -DargLine="-Dspark.test.home=${spark_dir}/shims/spark35/spark_home" when running unit tests.For most cases, please make sure Gluten native build is done before running a Scala/Java test.
wait to complete
cd the_directory_of_core_file_generated gdb ${GLUTEN_HOME}/cpp/build/releases/libgluten.so 'core-Executor task l-2000883-1671542526'
core-Executor task l-2000883-1671542526 represents the core file name.Currently, we have no dedicated memory allocator implemented by jemalloc. User can set environment variable LD_PRELOAD for lib jemalloc to let it override the corresponding C standard functions entirely. It may help alleviate OOM issues.
spark.executorEnv.LD_PREALOD=/path/to/libjemalloc.so
Now, both Parquet and DWRF format files are supported, related scripts and files are under the directory of ${GLUTEN_HOME}/backends-velox/workload/tpch. The file README.md under ${GLUTEN_HOME}/backends-velox/workload/tpch offers some useful help, but it's still not enough and exact.
One way of run TPC-H test is to run velox-be by workflow, you can refer to velox_backend.yml
Here we will explain how to run TPC-H on Velox backend with the Parquet file format.
${GLUTEN_HOME}/tools/workload/tpch/gen_data/parquet_dataset, you can get help from the above -mentioned README.md.${GLUTEN_HOME}/backends-velox/src/test/resources/tpch-data-parquet directly, if you just want to make simple TPC-H testing, this dataset is a good choice.Modify ${GLUTEN_HOME}/tools/workload/tpch/run_tpch/tpch_parquet.scala.
var parquet_file_path to correct directory. If using the small dataset directly in the step one, then modify it as below:var parquet_file_path = "gluten_home/backends-velox/src/test/resources/tpch-data-parquet"
var gluten_root to correct directory. If ${GLUTEN_HOME} is the directory of /home/gluten, then modify it as belowvar gluten_root = "/home/gluten"
Modify ${GLUTEN_HOME}/tools/workload/tpch/run_tpch/tpch-parquet.sh.
GLUTEN_JAR correctly. Please refer to the section of Build Gluten with Velox BackendSPARK_HOME correctly.Execute tpch-parquet.sh using the below command.
cd ${GLUTEN_HOME}/tools/workload/tpch/run_tpch/./tpch-parquet.shPlease refer to ${GLUTEN_HOME}/tools/workload/tpcds/README.md.
When your gluten spark jobs failed because of OOM, you can track the memory allocation's call stack by configuring spark.gluten.memory.backtrace.allocation = true. The above configuration will use BacktraceAllocationListener wrapping from SparkAllocationListener to create VeloxMemoryManager.
BacktraceAllocationListener will check every allocation, if a single allocation bytes exceeds a fixed value or the accumulative allocation bytes exceeds 1/2/3...G, the call stack of memory allocation will be outputted to standard output, you can check the backtrace and get some valuable information about tracking the memory exhaust issues.
You can also adjust the policy to decide when to backtrace, such as the fixed value.