docs/velox-backend-limitations.md - incubator-gluten - Git at Google

 ---
 layout: page
 title: Velox Backend Limitations
 nav_order: 5
 ---
 This document describes the limitations of velox backend by listing some known cases where exception will be thrown, gluten behaves incompatibly with spark, or certain plan's execution
 must fall back to vanilla spark, etc.

 ### Override of Spark classes (For Spark3.2 and Spark3.3)
 Gluten avoids to modify Spark's existing code and use Spark APIs if possible. However, some APIs aren't exposed in Vanilla spark and we have to copy the Spark file and do the hardcode changes. The list of override classes can be found as ignoreClasses in package/pom.xml . If you use customized Spark, you may check if the files are modified in your spark, otherwise your changes will be overrided.

 So you need to ensure preferentially load the Gluten jar to overwrite the jar of vanilla spark. Refer to [How to prioritize loading Gluten jars in Spark](https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-troubleshooting.md#incompatible-class-error-when-using-native-writer).

 If not officially supported spark3.2/3.3 version is used, NoSuchMethodError can be thrown at runtime. More details see [issue-4514](https://github.com/apache/incubator-gluten/issues/4514).

 ### Fallbacks
 Except the unsupported operators, functions, file formats, data sources listed in , there are some known cases also fall back to Vanilla Spark.

 #### ANSI
 Gluten currently doesn't support ANSI mode. If ANSI is enabled, Spark plan's execution will always fall back to vanilla Spark.

 #### Runtime BloomFilter
 Velox BloomFilter's serialization format is different from Spark's. BloomFilter binary generated by Velox can't be deserialized by vanilla spark. So if `might_contain` falls back, we fall back `bloom_filter_agg` to vanilla spark also.

 #### Case Sensitive mode
 Gluten only supports spark default case-insensitive mode. If case-sensitive mode is enabled, user may get incorrect result.

 #### Regexp functions
 In Velox, regexp functions (`rlike`, `regexp_extract`, etc.) are implemented based on RE2, while in Spark they are based on `java.util.regex`.
 * Lookaround (lookahead/lookbehind) pattern is not supported in RE2.
 * When matching white space with pattern "\\s", RE2 doesn't treat "\v" (or "\x0b") as white space, but `java.util.regex` does.

 There are a few unknown incompatible cases. If user cannot tolerate the incompatibility risk, please enable the below configuration property.
 ```
 spark.gluten.sql.fallbackRegexpExpressions
 ```

 #### FileSource format
 Currently, Gluten only fully supports parquet file format and partially support ORC. If other format is used, scan operator falls back to vanilla spark.

 #### Partitioned Table Scan
 Gluten only support the partitioned table scan when the file path contain the partition info, otherwise will fall back to vanilla spark.

 ### Incompatible behavior
 In certain cases, Gluten result may be different from Vanilla spark.

 #### JSON functions
 Velox only supports double quotes surrounded strings, not single quotes, in JSON data. If single quotes are used, gluten will produce incorrect result.

 Velox doesn't support [*] in path when get_json_object function is called and returns null instead.

 #### Parquet read conf
 Gluten supports `spark.files.ignoreCorruptFiles` with default false, if true, the behavior is same as config false.
 Gluten ignores `spark.sql.parquet.datetimeRebaseModeInRead`, it only returns what write in parquet file. It does not consider the difference between legacy
 hybrid (Julian Gregorian) calendar and Proleptic Gregorian calendar. The result may be different with vanilla spark.

 #### Parquet write conf
 Spark has `spark.sql.parquet.datetimeRebaseModeInWrite` config to decide whether legacy hybrid (Julian + Gregorian) calendar
 or Proleptic Gregorian calendar should be used during parquet writing for dates/timestamps. If the parquet to read is written
 by Spark with this config as true, Velox's TableScan will output different result when reading it back.

 #### Partition write (For Spark3.2 and Spark3.3)

 Gluten only supports static partition writes and does not support dynamic partition writes.

 ```scala
 spark.sql("CREATE TABLE t (c int, d long, e long) STORED AS PARQUET partitioned by (c, d)")
 spark.sql("INSERT OVERWRITE TABLE t partition(c=1, d=2) SELECT 3 as e")
 ```
 Gluten does not support dynamic partition write and bucket write, Exception may be raised if you use. e.g.,

 ```scala
 spark.range(100).selectExpr("id as c1", "id % 7 as p")
   .write
   .format("parquet")
   .partitionBy("p")
   .save(f.getCanonicalPath)
 ```

 #### Partition write (For Spark3.4 and later)

 Gluten supports static partition writes and dynamic partition writes.

 ```scala
 spark.sql("CREATE TABLE t (c int, d long, e long) STORED AS PARQUET partitioned by (c, d)")
 spark.sql("INSERT OVERWRITE TABLE t partition(c=1, d) SELECT 2 as d, 3 as e")
 ```

 Gluten does not support bucket write, and will fall back to vanilla Spark.

 ```scala
 spark.range(100).selectExpr("id as c1", "id % 7 as p")
   .write
   .format("parquet")
   .bucketBy(2, "c1")
   .save(f.getCanonicalPath)
 ```

 #### CTAS write (For Spark3.2 and Spark3.3)

 Gluten does not create table as select. It may raise exception. e.g.,

 ```scala
 spark.range(100).toDF("id")
   .write
   .format("parquet")
   .saveAsTable("velox_ctas")
 ```

 #### CTAS write (For Spark3.4 and later)

 Gluten supports create table as select with parquet file format.

 ```scala
 spark.range(100).toDF("id")
   .write
   .format("parquet")
   .saveAsTable("velox_ctas")
 ```

 #### HiveFileFormat write

 Gluten supports writes of HiveFileFormat when the output file type is of type `parquet` only

 #### NaN support
 Velox does NOT support NaN. So unexpected result can be obtained for a few cases, e.g., comparing a number with NaN.

 #### Configuration

 Parquet write only support three configs, other will not take effect.

 - compression code:
   - sql conf: `spark.sql.parquet.compression.codec`
   - option: `compression.codec`
 - block size
   - sql conf: `spark.gluten.sql.columnar.parquet.write.blockSize`
   - option: `parquet.block.size`
 - block rows
   - sql conf: `spark.gluten.sql.native.parquet.write.blockRows`
   - option: `parquet.block.rows`


 ### Fetal error caused by Spark's columnar reading
 If the user enables Spark's columnar reading, error can occur due to Spark's columnar vector is not compatible with
 Gluten's.

 ### Spill

 `OutOfMemoryException` may still be triggered within current implementation of spill-to-disk feature, when shuffle partitions is set to a large number. When this case happens, please try to reduce the partition number to get rid of the OOM.

 ### Unsupported Data type support in ParquetScan

 - Byte type causes fallback to vanilla spark
 - Timestamp type

   Only reading with INT96 and dictionary encoding is supported. When reading INT64 represented millisecond/microsecond timestamps, or INT96 represented timestamps of other encodings, exceptions can occur.

 - Complex types
   - Parquet scan of nested array with struct or array as element type is not supported in Velox (fallback behavior).
   - Parquet scan of nested map with struct as key type, or array type as value type is not supported in Velox (fallback behavior).

 ### CSV Read
 The header option should be true. And now we only support DatasourceV1, i.e., user should set `spark.sql.sources.useV1SourceList=csv`. User defined read option is not supported, which will make CSV read fall back to vanilla Spark in most case.
 CSV read will also fall back to vanilla Spark and log warning when user specifies schema is different with file schema.

 ### Utilizing Map Type as Hash Keys in ColumnarShuffleExchange
 Spark uses the `spark.sql.legacy.allowHashOnMapType` configuration to support hash map key functions.
 Gluten enables this configuration during the creation of ColumnarShuffleExchange, as shown in the code [link](https://github.com/apache/incubator-gluten/blob/0dacac84d3bf3d2759a5dd7e0735147852d2845d/backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxSparkPlanExecApi.scala#L355-L363).
 This method bypasses Spark's unresolved checks and creates projects with the hash(mapType) operator before ColumnarShuffleExchange.
 However, if `spark.sql.legacy.allowHashOnMapType` is disabled in a test environment, projects using the hash(mapType) expression may throw an `Invalid call to dataType on unresolved object` exception during validation, causing them to fallback to vanilla Spark, as referenced in the code [link](https://github.com/apache/spark/blob/de5fa426e23b84fc3c2bddeabcd2e1eda515abd5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L291-L296).
  Enabling this configuration allows the project to be offloaded to Velox.
	---
	layout: page
	title: Velox Backend Limitations
	nav_order: 5
	---
	This document describes the limitations of velox backend by listing some known cases where exception will be thrown, gluten behaves incompatibly with spark, or certain plan's execution
	must fall back to vanilla spark, etc.

	### Override of Spark classes (For Spark3.2 and Spark3.3)
	Gluten avoids to modify Spark's existing code and use Spark APIs if possible. However, some APIs aren't exposed in Vanilla spark and we have to copy the Spark file and do the hardcode changes. The list of override classes can be found as ignoreClasses in package/pom.xml . If you use customized Spark, you may check if the files are modified in your spark, otherwise your changes will be overrided.

	So you need to ensure preferentially load the Gluten jar to overwrite the jar of vanilla spark. Refer to [How to prioritize loading Gluten jars in Spark](https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-troubleshooting.md#incompatible-class-error-when-using-native-writer).

	If not officially supported spark3.2/3.3 version is used, NoSuchMethodError can be thrown at runtime. More details see [issue-4514](https://github.com/apache/incubator-gluten/issues/4514).

	### Fallbacks
	Except the unsupported operators, functions, file formats, data sources listed in , there are some known cases also fall back to Vanilla Spark.

	#### ANSI
	Gluten currently doesn't support ANSI mode. If ANSI is enabled, Spark plan's execution will always fall back to vanilla Spark.

	#### Runtime BloomFilter
	Velox BloomFilter's serialization format is different from Spark's. BloomFilter binary generated by Velox can't be deserialized by vanilla spark. So if `might_contain` falls back, we fall back `bloom_filter_agg` to vanilla spark also.

	#### Case Sensitive mode
	Gluten only supports spark default case-insensitive mode. If case-sensitive mode is enabled, user may get incorrect result.

	#### Regexp functions
	In Velox, regexp functions (`rlike`, `regexp_extract`, etc.) are implemented based on RE2, while in Spark they are based on `java.util.regex`.
	* Lookaround (lookahead/lookbehind) pattern is not supported in RE2.
	* When matching white space with pattern "\\s", RE2 doesn't treat "\v" (or "\x0b") as white space, but `java.util.regex` does.

	There are a few unknown incompatible cases. If user cannot tolerate the incompatibility risk, please enable the below configuration property.
	```
	spark.gluten.sql.fallbackRegexpExpressions
	```

	#### FileSource format
	Currently, Gluten only fully supports parquet file format and partially support ORC. If other format is used, scan operator falls back to vanilla spark.

	#### Partitioned Table Scan
	Gluten only support the partitioned table scan when the file path contain the partition info, otherwise will fall back to vanilla spark.

	### Incompatible behavior
	In certain cases, Gluten result may be different from Vanilla spark.

	#### JSON functions
	Velox only supports double quotes surrounded strings, not single quotes, in JSON data. If single quotes are used, gluten will produce incorrect result.

	Velox doesn't support [*] in path when get_json_object function is called and returns null instead.

	#### Parquet read conf
	Gluten supports `spark.files.ignoreCorruptFiles` with default false, if true, the behavior is same as config false.
	Gluten ignores `spark.sql.parquet.datetimeRebaseModeInRead`, it only returns what write in parquet file. It does not consider the difference between legacy
	hybrid (Julian Gregorian) calendar and Proleptic Gregorian calendar. The result may be different with vanilla spark.

	#### Parquet write conf
	Spark has `spark.sql.parquet.datetimeRebaseModeInWrite` config to decide whether legacy hybrid (Julian + Gregorian) calendar
	or Proleptic Gregorian calendar should be used during parquet writing for dates/timestamps. If the parquet to read is written
	by Spark with this config as true, Velox's TableScan will output different result when reading it back.

	#### Partition write (For Spark3.2 and Spark3.3)

	Gluten only supports static partition writes and does not support dynamic partition writes.

	```scala
	spark.sql("CREATE TABLE t (c int, d long, e long) STORED AS PARQUET partitioned by (c, d)")
	spark.sql("INSERT OVERWRITE TABLE t partition(c=1, d=2) SELECT 3 as e")
	```
	Gluten does not support dynamic partition write and bucket write, Exception may be raised if you use. e.g.,

	```scala
	spark.range(100).selectExpr("id as c1", "id % 7 as p")
	.write
	.format("parquet")
	.partitionBy("p")
	.save(f.getCanonicalPath)
	```

	#### Partition write (For Spark3.4 and later)

	Gluten supports static partition writes and dynamic partition writes.

	```scala
	spark.sql("CREATE TABLE t (c int, d long, e long) STORED AS PARQUET partitioned by (c, d)")
	spark.sql("INSERT OVERWRITE TABLE t partition(c=1, d) SELECT 2 as d, 3 as e")
	```

	Gluten does not support bucket write, and will fall back to vanilla Spark.

	```scala
	spark.range(100).selectExpr("id as c1", "id % 7 as p")
	.write
	.format("parquet")
	.bucketBy(2, "c1")
	.save(f.getCanonicalPath)
	```

	#### CTAS write (For Spark3.2 and Spark3.3)

	Gluten does not create table as select. It may raise exception. e.g.,

	```scala
	spark.range(100).toDF("id")
	.write
	.format("parquet")
	.saveAsTable("velox_ctas")
	```

	#### CTAS write (For Spark3.4 and later)

	Gluten supports create table as select with parquet file format.

	```scala
	spark.range(100).toDF("id")
	.write
	.format("parquet")
	.saveAsTable("velox_ctas")
	```

	#### HiveFileFormat write

	Gluten supports writes of HiveFileFormat when the output file type is of type `parquet` only

	#### NaN support
	Velox does NOT support NaN. So unexpected result can be obtained for a few cases, e.g., comparing a number with NaN.

	#### Configuration

	Parquet write only support three configs, other will not take effect.

	- compression code:
	- sql conf: `spark.sql.parquet.compression.codec`
	- option: `compression.codec`
	- block size
	- sql conf: `spark.gluten.sql.columnar.parquet.write.blockSize`
	- option: `parquet.block.size`
	- block rows
	- sql conf: `spark.gluten.sql.native.parquet.write.blockRows`
	- option: `parquet.block.rows`



	### Fetal error caused by Spark's columnar reading
	If the user enables Spark's columnar reading, error can occur due to Spark's columnar vector is not compatible with
	Gluten's.

	### Spill

	`OutOfMemoryException` may still be triggered within current implementation of spill-to-disk feature, when shuffle partitions is set to a large number. When this case happens, please try to reduce the partition number to get rid of the OOM.

	### Unsupported Data type support in ParquetScan

	- Byte type causes fallback to vanilla spark
	- Timestamp type

	Only reading with INT96 and dictionary encoding is supported. When reading INT64 represented millisecond/microsecond timestamps, or INT96 represented timestamps of other encodings, exceptions can occur.

	- Complex types
	- Parquet scan of nested array with struct or array as element type is not supported in Velox (fallback behavior).
	- Parquet scan of nested map with struct as key type, or array type as value type is not supported in Velox (fallback behavior).

	### CSV Read
	The header option should be true. And now we only support DatasourceV1, i.e., user should set `spark.sql.sources.useV1SourceList=csv`. User defined read option is not supported, which will make CSV read fall back to vanilla Spark in most case.
	CSV read will also fall back to vanilla Spark and log warning when user specifies schema is different with file schema.

	### Utilizing Map Type as Hash Keys in ColumnarShuffleExchange
	Spark uses the `spark.sql.legacy.allowHashOnMapType` configuration to support hash map key functions.
	Gluten enables this configuration during the creation of ColumnarShuffleExchange, as shown in the code [link](https://github.com/apache/incubator-gluten/blob/0dacac84d3bf3d2759a5dd7e0735147852d2845d/backends-velox/src/main/scala/org/apache/gluten/backendsapi/velox/VeloxSparkPlanExecApi.scala#L355-L363).
	This method bypasses Spark's unresolved checks and creates projects with the hash(mapType) operator before ColumnarShuffleExchange.
	However, if `spark.sql.legacy.allowHashOnMapType` is disabled in a test environment, projects using the hash(mapType) expression may throw an `Invalid call to dataType on unresolved object` exception during validation, causing them to fallback to vanilla Spark, as referenced in the code [link](https://github.com/apache/spark/blob/de5fa426e23b84fc3c2bddeabcd2e1eda515abd5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala#L291-L296).
	Enabling this configuration allows the project to be offloaded to Velox.