docs/developers/HowTo.md - gluten - Git at Google

 ---
 layout: page
 title: How To Use Gluten
 nav_order: 1
 parent: Developer Overview
 ---
 There are some common questions about developing, debugging and testing been asked again and again. In order to help the developers to contribute
 to Gluten as soon as possible, we collected these frequently asked questions, and organized them in the form of Q&A. It's convenient for the developers
 to check and learn.

 When you encountered a new problem and then resolved it, please add a new item to this document if you think it may be helpful to the other developers.

 We use `${GLUTEN_HOME}` to represent the home directory of Gluten in this document.

 # How to understand the key work of Gluten?

 The Gluten worked as the role of bridge, it's a middle layer between the Spark and the native execution library.

 The Gluten is responsible for validating whether the operators of the Spark plan can be executed by the native engine or not. If yes, the Gluten
 transforms Spark plan to Substrait plan, and then send the Substrait plan to the native engine.

 The Gluten codes consist of two parts: the C++ codes and the Java/Scala codes.
 1. All C++ codes are placed under the directory of `${GLUTEN_HOME}/cpp`, the Java/Scala codes are located elsewhere.
 2. The Java/Scala codes are responsible for validating and transforming the execution plan. Source data should also be provided, the source data may
   come from files or other forms such as networks.
 3. The C++ codes take the Substrait plan and the source data as inputs and transform the Substrait plan to the corresponding backend plan. If the backend
   is Velox, the Substrait plan will be transformed to the Velox plan, and then be executed.

 JNI is a programming technology of invoking C++ from Java. All JNI interfaces are defined in the file `JniWrapper.cc` under the directory `jni`.

 # How to debug in Gluten?

 ## 1 How to debug C++
 If you don't concern about the Scala/Java codes and just want to debug the C++ codes executed in native engine, you may debug the C++ via benchmarks
 with GDB.

 To debug C++, you have to generate the example files, the example files consist of:
 - A file contained Substrait plan in JSON format
 - One or more input data files in Parquet format

 You can generate the example files by the following steps:

 1. Build Velox and Gluten CPP:

 ```
 ${GLUTEN_HOME}/dev/builddeps-veloxbe.sh --build_tests=ON --build_benchmarks=ON --build_examples=ON --build_type=Debug
 ```

 - Compiling with `--build_type=Debug` is good for debugging.
 - The executable file `generic_benchmark` will be generated under the directory of `gluten_home/cpp/build/velox/benchmarks/`.

 2. Build Gluten and generate the example files:

 ```
 cd ${GLUTEN_HOME}
 mvn test -Pspark-3.2 -Pbackends-velox -pl backends-velox \
 -am -DtagsToInclude="org.apache.gluten.tags.GenerateExample" \
 -Dtest=none -DfailIfNoTests=false \
 -Dexec.skip
 ```

 - After the above operations, the example files are generated under `${GLUTEN_HOME}/backends-velox`
 - You can check it by the command `tree ${GLUTEN_HOME}/backends-velox/generated-native-benchmark/`
 - You may replace `-Pspark-3.2` with `-Pspark-3.3` if your spark's version is 3.3

 ```shell
 $ tree ${GLUTEN_HOME}/backends-velox/generated-native-benchmark/
 /some-dir-to-gluten-home/backends-velox/generated-native-benchmark/
 |-- conf_12_0.ini
 |-- data_12_0_0.parquet
 |-- data_12_0_1.parquet
 `-- plan_12_0.json
 ```

 3. Now, run benchmarks with GDB

 ```shell
 cd ${GLUTEN_HOME}
 gdb cpp/build/velox/benchmarks/generic_benchmark
 ```

 - When GDB load `generic_benchmark` successfully, you can set `breakpoint` on the `main` function with command `b main`, and then run using the `r` command with
   arguments for the example files like:
   ```
   r --with-shuffle --partitioning hash --threads 1 --iterations 1 \
     --conf backends-velox/generated-native-benchmark/conf_12_0.ini \
     --plan backends-velox/generated-native-benchmark/plan_12_0.json \
     --data backends-velox/generated-native-benchmark/data_12_0_0.parquet,backends-velox/generated-native-benchmark/data_12_0_1.parquet
   ```
   The process `generic_benchmark` will start and stop at the `main` function.
 - You can check the variables' state with command `p variable_name`, or execute the program line by line with command `n`, or step-in the function been
   called with command `s`.
 - Actually, you can debug `generic_benchmark` with any gdb commands as debugging normal C++ program, because the `generic_benchmark` is a pure C++
   executable file in fact.

 4. `gdb-tui` is a valuable feature and is worth trying. You can get more help from the online docs.
 [gdb-tui](https://sourceware.org/gdb/onlinedocs/gdb/TUI.html)

 5. You can start `generic_benchmark` with specific JSON plan and input files
 - You can also edit the file `plan_12_0.json` to custom the Substrait plan or specify the inputs files placed in the other directory.

 6. Get more detail information about benchmarks from [MicroBenchmarks](./MicroBenchmarks.md)

 ## 2 How to debug plan validation process

 Gluten will validate generated plan before execute it, and validation usually happens in native side, so we provide a utility to help debug validation process in native side.

 1. Run query with conf `spark.gluten.sql.debug=true`, and you will find generated plan be printed in stderr with json format, save it as `plan.json` for example.
 2. Compile cpp part with `--build_benchmarks=ON`, then check `plan_validator_util` executable file in `${GLUTEN_HOME}/cpp/build/velox/benchmarks/`.
 3. Run or debug with `./plan_validator_util <path>/plan.json`

 ## 3 How to debug Java/Scala

 To debug some runtime issues in Scala/Java, we recommend developers to use Intellij remote debug, see [tutorial link](https://www.jetbrains.com/help/idea/tutorial-remote-debug.html).

 According to your setting for Intellij remote debug, please set `SPARK_SUBMIT_OPTS` in the environment where spark-submit is executed. See the below example.

 ```
 export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008
 ```

 To run a Scala/Java test class, you can use the below mvn command (take Velox backend as example), which is helpful to debug some unit test failure reported by Gluten CI.
 ```
 mvn test -Pspark-3.5 -Pspark-ut -Pbackends-velox -DargLine="-Dspark.test.home=/path/to/spark/source/code/home/" -DwildcardSuites=xxx
 ```

 Please set `wildcardSuites` with a fully qualified class name. `spark.test.home` is optional to set. It is only required for some test suites to use Spark resources.
 If you are specifying the `spark.test.home` arg, it should be set to either:
 * The path a directory containing Spark source code, which has already been built
 * Or use the `install-spark-resources.sh` script to get a directory with the necessary resource files:
   ```
   # Define a directory to use for the Spark files and the Spark version
   export spark_dir=/tmp/spark
   export spark_version=3.5

   # Run the install-spark-resources.sh script
   .github/workflows/util/install-spark-resources.sh ${spark_version} ${spark_dir}
   ```
   After running the `install-spark-resources.sh`, define the `spark.test.home` directory like:
   `-DargLine="-Dspark.test.home=${spark_dir}/shims/spark35/spark_home"` when running unit tests.

 For most cases, please make sure Gluten native build is done before running a Scala/Java test.

 ## 4 How to debug with core-dump
 wait to complete

 ```shell
 cd the_directory_of_core_file_generated
 gdb ${GLUTEN_HOME}/cpp/build/releases/libgluten.so 'core-Executor task l-2000883-1671542526'

 ```
 - the `core-Executor task l-2000883-1671542526` represents the core file name.

 # How to use jemalloc for Gluten native engine

 Currently, we have no dedicated memory allocator implemented by jemalloc. User can set environment variable `LD_PRELOAD` for lib jemalloc
 to let it override the corresponding C standard functions entirely. It may help alleviate OOM issues.

 `spark.executorEnv.LD_PREALOD=/path/to/libjemalloc.so`

 # How to run TPC-H on Velox backend

 Now, both Parquet and DWRF format files are supported, related scripts and files are under the directory of `${GLUTEN_HOME}/backends-velox/workload/tpch`.
 The file `README.md` under `${GLUTEN_HOME}/backends-velox/workload/tpch` offers some useful help, but it's still not enough and exact.

 One way of run TPC-H test is to run velox-be by workflow, you can refer to [velox_backend.yml](https://github.com/apache/gluten/blob/main/.github/workflows/velox_backend.yml#L280)

 Here we will explain how to run TPC-H on Velox backend with the Parquet file format.
 1. First, prepare the datasets, you have two choices.
   - One way, generate Parquet datasets using the script under `${GLUTEN_HOME}/tools/workload/tpch/gen_data/parquet_dataset`, you can get help from the above
     -mentioned `README.md`.
   - The other way, using the small dataset under `${GLUTEN_HOME}/backends-velox/src/test/resources/tpch-data-parquet` directly, if you just want to make simple
     TPC-H testing, this dataset is a good choice.
 2. Second, run TPC-H on Velox backend testing.
   - Modify `${GLUTEN_HOME}/tools/workload/tpch/run_tpch/tpch_parquet.scala`.
     - Set `var parquet_file_path` to correct directory. If using the small dataset directly in the step one, then modify it as below:

     ```scala
     var parquet_file_path = "gluten_home/backends-velox/src/test/resources/tpch-data-parquet"
     ```

     - Set `var gluten_root` to correct directory. If `${GLUTEN_HOME}` is the directory of `/home/gluten`, then modify it as below

     ```scala
     var gluten_root = "/home/gluten"
     ```

   - Modify `${GLUTEN_HOME}/tools/workload/tpch/run_tpch/tpch-parquet.sh`.
     - Set `GLUTEN_JAR` correctly. Please refer to the section of [Build Gluten with Velox Backend](../get-started/Velox.md#build-gluten-with-velox-backend)
     - Set `SPARK_HOME` correctly.
     - Set the memory configurations appropriately.
   - Execute `tpch-parquet.sh` using the below command.
     - `cd ${GLUTEN_HOME}/tools/workload/tpch/run_tpch/`
     - `./tpch-parquet.sh`

 # How to run TPC-DS

 Please refer to `${GLUTEN_HOME}/tools/workload/tpcds/README.md`.

 # How to track the memory exhaust problem

 When your gluten spark jobs failed because of OOM, you can track the memory allocation's call stack by configuring `spark.gluten.memory.backtrace.allocation = true`.
 The above configuration will use `BacktraceAllocationListener` wrapping from `SparkAllocationListener` to create `VeloxMemoryManager`.

 `BacktraceAllocationListener` will check every allocation, if a single allocation bytes exceeds a fixed value or the accumulative allocation bytes exceeds 1/2/3...G,
 the call stack of memory allocation will be outputted to standard output, you can check the backtrace and get some valuable information about tracking the memory exhaust issues.

 You can also adjust the policy to decide when to backtrace, such as the fixed value.
	---
	layout: page
	title: How To Use Gluten
	nav_order: 1
	parent: Developer Overview
	---
	There are some common questions about developing, debugging and testing been asked again and again. In order to help the developers to contribute
	to Gluten as soon as possible, we collected these frequently asked questions, and organized them in the form of Q&A. It's convenient for the developers
	to check and learn.

	When you encountered a new problem and then resolved it, please add a new item to this document if you think it may be helpful to the other developers.

	We use `${GLUTEN_HOME}` to represent the home directory of Gluten in this document.

	# How to understand the key work of Gluten?

	The Gluten worked as the role of bridge, it's a middle layer between the Spark and the native execution library.

	The Gluten is responsible for validating whether the operators of the Spark plan can be executed by the native engine or not. If yes, the Gluten
	transforms Spark plan to Substrait plan, and then send the Substrait plan to the native engine.

	The Gluten codes consist of two parts: the C++ codes and the Java/Scala codes.
	1. All C++ codes are placed under the directory of `${GLUTEN_HOME}/cpp`, the Java/Scala codes are located elsewhere.
	2. The Java/Scala codes are responsible for validating and transforming the execution plan. Source data should also be provided, the source data may
	come from files or other forms such as networks.
	3. The C++ codes take the Substrait plan and the source data as inputs and transform the Substrait plan to the corresponding backend plan. If the backend
	is Velox, the Substrait plan will be transformed to the Velox plan, and then be executed.

	JNI is a programming technology of invoking C++ from Java. All JNI interfaces are defined in the file `JniWrapper.cc` under the directory `jni`.

	# How to debug in Gluten?

	## 1 How to debug C++
	If you don't concern about the Scala/Java codes and just want to debug the C++ codes executed in native engine, you may debug the C++ via benchmarks
	with GDB.

	To debug C++, you have to generate the example files, the example files consist of:
	- A file contained Substrait plan in JSON format
	- One or more input data files in Parquet format

	You can generate the example files by the following steps:

	1. Build Velox and Gluten CPP:

	```
	${GLUTEN_HOME}/dev/builddeps-veloxbe.sh --build_tests=ON --build_benchmarks=ON --build_examples=ON --build_type=Debug
	```

	- Compiling with `--build_type=Debug` is good for debugging.
	- The executable file `generic_benchmark` will be generated under the directory of `gluten_home/cpp/build/velox/benchmarks/`.

	2. Build Gluten and generate the example files:

	```
	cd ${GLUTEN_HOME}
	mvn test -Pspark-3.2 -Pbackends-velox -pl backends-velox \
	-am -DtagsToInclude="org.apache.gluten.tags.GenerateExample" \
	-Dtest=none -DfailIfNoTests=false \
	-Dexec.skip
	```

	- After the above operations, the example files are generated under `${GLUTEN_HOME}/backends-velox`
	- You can check it by the command `tree ${GLUTEN_HOME}/backends-velox/generated-native-benchmark/`
	- You may replace `-Pspark-3.2` with `-Pspark-3.3` if your spark's version is 3.3

	```shell
	$ tree ${GLUTEN_HOME}/backends-velox/generated-native-benchmark/
	/some-dir-to-gluten-home/backends-velox/generated-native-benchmark/
	\|-- conf_12_0.ini
	\|-- data_12_0_0.parquet
	\|-- data_12_0_1.parquet
	`-- plan_12_0.json
	```

	3. Now, run benchmarks with GDB

	```shell
	cd ${GLUTEN_HOME}
	gdb cpp/build/velox/benchmarks/generic_benchmark
	```

	- When GDB load `generic_benchmark` successfully, you can set `breakpoint` on the `main` function with command `b main`, and then run using the `r` command with
	arguments for the example files like:
	```
	r --with-shuffle --partitioning hash --threads 1 --iterations 1 \
	--conf backends-velox/generated-native-benchmark/conf_12_0.ini \
	--plan backends-velox/generated-native-benchmark/plan_12_0.json \
	--data backends-velox/generated-native-benchmark/data_12_0_0.parquet,backends-velox/generated-native-benchmark/data_12_0_1.parquet
	```
	The process `generic_benchmark` will start and stop at the `main` function.
	- You can check the variables' state with command `p variable_name`, or execute the program line by line with command `n`, or step-in the function been
	called with command `s`.
	- Actually, you can debug `generic_benchmark` with any gdb commands as debugging normal C++ program, because the `generic_benchmark` is a pure C++
	executable file in fact.

	4. `gdb-tui` is a valuable feature and is worth trying. You can get more help from the online docs.
	[gdb-tui](https://sourceware.org/gdb/onlinedocs/gdb/TUI.html)

	5. You can start `generic_benchmark` with specific JSON plan and input files
	- You can also edit the file `plan_12_0.json` to custom the Substrait plan or specify the inputs files placed in the other directory.

	6. Get more detail information about benchmarks from [MicroBenchmarks](./MicroBenchmarks.md)

	## 2 How to debug plan validation process

	Gluten will validate generated plan before execute it, and validation usually happens in native side, so we provide a utility to help debug validation process in native side.

	1. Run query with conf `spark.gluten.sql.debug=true`, and you will find generated plan be printed in stderr with json format, save it as `plan.json` for example.
	2. Compile cpp part with `--build_benchmarks=ON`, then check `plan_validator_util` executable file in `${GLUTEN_HOME}/cpp/build/velox/benchmarks/`.
	3. Run or debug with `./plan_validator_util <path>/plan.json`

	## 3 How to debug Java/Scala

	To debug some runtime issues in Scala/Java, we recommend developers to use Intellij remote debug, see [tutorial link](https://www.jetbrains.com/help/idea/tutorial-remote-debug.html).

	According to your setting for Intellij remote debug, please set `SPARK_SUBMIT_OPTS` in the environment where spark-submit is executed. See the below example.

	```
	export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008
	```

	To run a Scala/Java test class, you can use the below mvn command (take Velox backend as example), which is helpful to debug some unit test failure reported by Gluten CI.
	```
	mvn test -Pspark-3.5 -Pspark-ut -Pbackends-velox -DargLine="-Dspark.test.home=/path/to/spark/source/code/home/" -DwildcardSuites=xxx
	```

	Please set `wildcardSuites` with a fully qualified class name. `spark.test.home` is optional to set. It is only required for some test suites to use Spark resources.
	If you are specifying the `spark.test.home` arg, it should be set to either:
	* The path a directory containing Spark source code, which has already been built
	* Or use the `install-spark-resources.sh` script to get a directory with the necessary resource files:
	```
	# Define a directory to use for the Spark files and the Spark version
	export spark_dir=/tmp/spark
	export spark_version=3.5

	# Run the install-spark-resources.sh script
	.github/workflows/util/install-spark-resources.sh ${spark_version} ${spark_dir}
	```
	After running the `install-spark-resources.sh`, define the `spark.test.home` directory like:
	`-DargLine="-Dspark.test.home=${spark_dir}/shims/spark35/spark_home"` when running unit tests.

	For most cases, please make sure Gluten native build is done before running a Scala/Java test.

	## 4 How to debug with core-dump
	wait to complete

	```shell
	cd the_directory_of_core_file_generated
	gdb ${GLUTEN_HOME}/cpp/build/releases/libgluten.so 'core-Executor task l-2000883-1671542526'

	```
	- the `core-Executor task l-2000883-1671542526` represents the core file name.

	# How to use jemalloc for Gluten native engine

	Currently, we have no dedicated memory allocator implemented by jemalloc. User can set environment variable `LD_PRELOAD` for lib jemalloc
	to let it override the corresponding C standard functions entirely. It may help alleviate OOM issues.

	`spark.executorEnv.LD_PREALOD=/path/to/libjemalloc.so`

	# How to run TPC-H on Velox backend

	Now, both Parquet and DWRF format files are supported, related scripts and files are under the directory of `${GLUTEN_HOME}/backends-velox/workload/tpch`.
	The file `README.md` under `${GLUTEN_HOME}/backends-velox/workload/tpch` offers some useful help, but it's still not enough and exact.

	One way of run TPC-H test is to run velox-be by workflow, you can refer to [velox_backend.yml](https://github.com/apache/gluten/blob/main/.github/workflows/velox_backend.yml#L280)

	Here we will explain how to run TPC-H on Velox backend with the Parquet file format.
	1. First, prepare the datasets, you have two choices.
	- One way, generate Parquet datasets using the script under `${GLUTEN_HOME}/tools/workload/tpch/gen_data/parquet_dataset`, you can get help from the above
	-mentioned `README.md`.
	- The other way, using the small dataset under `${GLUTEN_HOME}/backends-velox/src/test/resources/tpch-data-parquet` directly, if you just want to make simple
	TPC-H testing, this dataset is a good choice.
	2. Second, run TPC-H on Velox backend testing.
	- Modify `${GLUTEN_HOME}/tools/workload/tpch/run_tpch/tpch_parquet.scala`.
	- Set `var parquet_file_path` to correct directory. If using the small dataset directly in the step one, then modify it as below:

	```scala
	var parquet_file_path = "gluten_home/backends-velox/src/test/resources/tpch-data-parquet"
	```

	- Set `var gluten_root` to correct directory. If `${GLUTEN_HOME}` is the directory of `/home/gluten`, then modify it as below

	```scala
	var gluten_root = "/home/gluten"
	```

	- Modify `${GLUTEN_HOME}/tools/workload/tpch/run_tpch/tpch-parquet.sh`.
	- Set `GLUTEN_JAR` correctly. Please refer to the section of [Build Gluten with Velox Backend](../get-started/Velox.md#build-gluten-with-velox-backend)
	- Set `SPARK_HOME` correctly.
	- Set the memory configurations appropriately.
	- Execute `tpch-parquet.sh` using the below command.
	- `cd ${GLUTEN_HOME}/tools/workload/tpch/run_tpch/`
	- `./tpch-parquet.sh`

	# How to run TPC-DS

	Please refer to `${GLUTEN_HOME}/tools/workload/tpcds/README.md`.

	# How to track the memory exhaust problem

	When your gluten spark jobs failed because of OOM, you can track the memory allocation's call stack by configuring `spark.gluten.memory.backtrace.allocation = true`.
	The above configuration will use `BacktraceAllocationListener` wrapping from `SparkAllocationListener` to create `VeloxMemoryManager`.

	`BacktraceAllocationListener` will check every allocation, if a single allocation bytes exceeds a fixed value or the accumulative allocation bytes exceeds 1/2/3...G,
	the call stack of memory allocation will be outputted to standard output, you can check the backtrace and get some valuable information about tracking the memory exhaust issues.

	You can also adjust the policy to decide when to backtrace, such as the fixed value.