layout: page title: Configuration nav_order: 15

Spark base configurations for Gluten plugin

KeyRecommend SettingDescription
spark.pluginsorg.apache.gluten.GlutenPluginTo load Gluten‘s components by Spark’s plug-in loader.
spark.memory.offHeap.enabledtrueGluten use off-heap memory for certain operations.
spark.memory.offHeap.size30GThe absolute amount of memory which can be used for off-heap allocation, in bytes unless otherwise specified.
Note: Gluten Plugin will leverage this setting to allocate memory space for native usage even offHeap is disabled.
The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Gluten Plugin.
spark.shuffle.managerorg.apache.spark.shuffle.sort.ColumnarShuffleManagerTo turn on Gluten Columnar Shuffle Plugin.
spark.driver.extraClassPath/path/to/gluten_jar_fileGluten Plugin jar file to prepend to the classpath of the driver.
spark.executor.extraClassPath/path/to/gluten_jar_fileGluten Plugin jar file to prepend to the classpath of executors.

Gluten configurations

KeyDefaultDescription
spark.gluten.enabledtrueWhether to enable gluten. Default value is true. Just an experimental property. Recommend to enable/disable Gluten through the setting for spark.plugins.
spark.gluten.execution.resource.expired.time86400Expired time of execution with resource relation has cached.
spark.gluten.expression.blacklist<undefined>A black list of expression to skip transform, multiple values separated by commas.
spark.gluten.loadLibFromJarfalseWhether to load shared libraries from jars.
spark.gluten.loadLibOS<undefined>The shared library loader's OS name.
spark.gluten.loadLibOSVersion<undefined>The shared library loader's OS version.
spark.gluten.memory.isolationfalseEnable isolated memory mode. If true, Gluten controls the maximum off-heap memory can be used by each task to X, X = executor memory / max task slots. It's recommended to set true if Gluten serves concurrent queries within a single session, since not all memory Gluten allocated is guaranteed to be spillable. In the case, the feature should be enabled to avoid OOM.
spark.gluten.memory.overAcquiredMemoryRatio0.3If larger than 0, Velox backend will try over-acquire this ratio of the total allocated memory as backup to avoid OOM.
spark.gluten.memory.reservationBlockSize8MBBlock size of native reservation listener reserve memory from Spark.
spark.gluten.numTaskSlotsPerExecutor-1Must provide default value since non-execution operations (e.g. org.apache.spark.sql.Dataset#summary) doesn't propagate configurations using org.apache.spark.sql.execution.SQLExecution#withSQLConfPropagated
spark.gluten.ras.costModellegacyThe class name of user-defined cost model that will be used by Gluten's transition planner as well as by RAS. If not specified, a legacy built-in cost model will be used. The legacy cost model helps RAS planner exhaustively offload computations, and helps transition planner choose columnar-to-columnar transition over others.
spark.gluten.ras.enabledfalseEnables RAS (relational algebra selector) during physical planning to generate more efficient query plan. Note, this feature doesn't bring performance profits by default. Try exploring option spark.gluten.ras.costModel for advanced usage.
spark.gluten.shuffleWriter.bufferSize<undefined>
spark.gluten.soft-affinity.duplicateReading.maxCacheItems10000Enable Soft Affinity duplicate reading detection
spark.gluten.soft-affinity.duplicateReadingDetect.enabledfalseIf true, Enable Soft Affinity duplicate reading detection
spark.gluten.soft-affinity.enabledfalseWhether to enable Soft Affinity scheduling.
spark.gluten.soft-affinity.min.target-hosts1For on HDFS, if there are already target hosts, and then prefer to use the original target hosts to schedule
spark.gluten.soft-affinity.replications.num2Calculate the number of the replications for scheduling to the target executors per file
spark.gluten.sql.adaptive.costEvaluator.enabledtrueIf true, use org.apache.spark.sql.execution.adaptive.GlutenCostEvaluator as custom cost evaluator class, else follow the configuration spark.sql.adaptive.customCostEvaluatorClass.
spark.gluten.sql.ansiFallback.enabledtrueWhen true (default), Gluten will fall back to Spark when ANSI mode is enabled. When false, Gluten will attempt to execute in ANSI mode.
spark.gluten.sql.broadcastNestedLoopJoinTransformerEnabledtrueConfig to enable BroadcastNestedLoopJoinExecTransformer.
spark.gluten.sql.cacheWholeStageTransformerContextfalseWhen true, WholeStageTransformer will cache the WholeStageTransformerContext when executing. It is used to get substrait plan node and native plan string.
spark.gluten.sql.cartesianProductTransformerEnabledtrueConfig to enable CartesianProductExecTransformer.
spark.gluten.sql.collapseGetJsonObject.enabledfalseCollapse nested get_json_object functions as one for optimization.
spark.gluten.sql.columnar.appendDatatrueEnable or disable columnar v2 command append data.
spark.gluten.sql.columnar.arrowUdftrueEnable or disable columnar arrow udf.
spark.gluten.sql.columnar.batchscantrueEnable or disable columnar batchscan.
spark.gluten.sql.columnar.broadcastExchangetrueEnable or disable columnar broadcastExchange.
spark.gluten.sql.columnar.broadcastJointrueEnable or disable columnar broadcastJoin.
spark.gluten.sql.columnar.cast.avgtrue
spark.gluten.sql.columnar.coalescetrueEnable or disable columnar coalesce.
spark.gluten.sql.columnar.collectLimittrueEnable or disable columnar collectLimit.
spark.gluten.sql.columnar.collectTailtrueEnable or disable columnar collectTail.
spark.gluten.sql.columnar.enableNestedColumnPruningInHiveTableScantrueEnable or disable nested column pruning in hivetablescan.
spark.gluten.sql.columnar.enableVanillaVectorizedReaderstrueEnable or disable vanilla vectorized scan.
spark.gluten.sql.columnar.executor.libpathThe gluten executor library path.
spark.gluten.sql.columnar.expandtrueEnable or disable columnar expand.
spark.gluten.sql.columnar.fallback.expressions.threshold50Fall back filter/project if number of nested expressions reaches this threshold, considering Spark codegen can bring better performance for such case.
spark.gluten.sql.columnar.fallback.ignoreRowToColumnartrueWhen true, the fallback policy ignores the RowToColumnar when counting fallback number.
spark.gluten.sql.columnar.fallback.preferColumnartrueWhen true, the fallback policy prefers to use Gluten plan rather than vanilla Spark plan if the both of them contains ColumnarToRow and the vanilla Spark plan ColumnarToRow number is not smaller than Gluten plan.
spark.gluten.sql.columnar.filescantrueEnable or disable columnar filescan.
spark.gluten.sql.columnar.filtertrueEnable or disable columnar filter.
spark.gluten.sql.columnar.force.hashaggtrueWhether to force to use gluten‘s hash agg for replacing vanilla spark’s sort agg.
spark.gluten.sql.columnar.forceShuffledHashJointrue
spark.gluten.sql.columnar.generatetrue
spark.gluten.sql.columnar.hashaggtrueEnable or disable columnar hashagg.
spark.gluten.sql.columnar.hivetablescantrueEnable or disable columnar hivetablescan.
spark.gluten.sql.columnar.libnameglutenThe gluten library name.
spark.gluten.sql.columnar.libpathThe gluten library path.
spark.gluten.sql.columnar.limittrue
spark.gluten.sql.columnar.maxBatchSize4096
spark.gluten.sql.columnar.overwriteByExpressiontrueEnable or disable columnar v2 command overwrite by expression.
spark.gluten.sql.columnar.parquet.write.blockSize128MB
spark.gluten.sql.columnar.partial.projecttrueBreak up one project node into 2 phases when some of the expressions are non offload-able. Phase one is a regular offloaded project transformer that evaluates the offload-able expressions in native, phase two preserves the output from phase one and evaluates the remaining non-offload-able expressions using vanilla Spark projections
spark.gluten.sql.columnar.partial.generatetrueevaluates the non-offload-able HiveUDTF using vanilla Spark generator
spark.gluten.sql.columnar.physicalJoinOptimizationLevel12Fallback to row operators if there are several continuous joins.
spark.gluten.sql.columnar.physicalJoinOptimizeEnablefalseEnable or disable columnar physicalJoinOptimize.
spark.gluten.sql.columnar.preferStreamingAggregatetrueVelox backend supports StreamingAggregate. StreamingAggregate uses the less memory as it does not need to hold all groups in memory, so it could avoid spill. When true and the child output ordering satisfies the grouping key then Gluten will choose StreamingAggregate as the native operator.
spark.gluten.sql.columnar.projecttrueEnable or disable columnar project.
spark.gluten.sql.columnar.project.collapsetrueCombines two columnar project operators into one and perform alias substitution
spark.gluten.sql.columnar.query.fallback.threshold-1The threshold for whether query will fall back by counting the number of ColumnarToRow & vanilla leaf node.
spark.gluten.sql.columnar.rangetrueEnable or disable columnar range.
spark.gluten.sql.columnar.replaceDatatrueEnable or disable columnar v2 command replace data.
spark.gluten.sql.columnar.scanOnlyfalseWhen enabled, only scan and the filter after scan will be offloaded to native.
spark.gluten.sql.columnar.shuffletrueEnable or disable columnar shuffle.
spark.gluten.sql.columnar.shuffle.celeborn.fallback.enabledtrueIf enabled, fall back to ColumnarShuffleManager when celeborn service is unavailable.Otherwise, throw an exception.
spark.gluten.sql.columnar.shuffle.celeborn.useRssSorttrueIf true, use RSS sort implementation for Celeborn sort-based shuffle.If false, use Gluten's row-based sort implementation. Only valid when spark.celeborn.client.spark.shuffle.writer is set to sort.
spark.gluten.sql.columnar.shuffle.codec<undefined>By default, the supported codecs are lz4 and zstd. When spark.gluten.sql.columnar.shuffle.codecBackend=qat,the supported codecs are gzip and zstd.
spark.gluten.sql.columnar.shuffle.codecBackend<undefined>
spark.gluten.sql.columnar.shuffle.compression.threshold100If number of rows in a batch falls below this threshold, will copy all buffers into one buffer to compress.
spark.gluten.sql.columnar.shuffle.dictionary.enabledfalseEnable dictionary in hash-based shuffle.
spark.gluten.sql.columnar.shuffle.merge.threshold0.25
spark.gluten.sql.columnar.shuffle.readerBufferSize1MBBuffer size in bytes for shuffle reader reading input stream from local or remote.
spark.gluten.sql.columnar.shuffle.realloc.threshold0.25
spark.gluten.sql.columnar.shuffle.sort.columns.threshold100000The threshold to determine whether to use sort-based columnar shuffle. Sort-based shuffle will be used if the number of columns is greater than this threshold.
spark.gluten.sql.columnar.shuffle.sort.deserializerBufferSize1MBBuffer size in bytes for sort-based shuffle reader deserializing raw input to columnar batch.
spark.gluten.sql.columnar.shuffle.sort.partitions.threshold4000The threshold to determine whether to use sort-based columnar shuffle. Sort-based shuffle will be used if the number of partitions is greater than this threshold.
spark.gluten.sql.columnar.shuffledHashJointrueEnable or disable columnar shuffledHashJoin.
spark.gluten.sql.columnar.shuffledHashJoin.optimizeBuildSidetrueWhether to allow Gluten to choose an optimal build side for shuffled hash join.
spark.gluten.sql.columnar.sorttrueEnable or disable columnar sort.
spark.gluten.sql.columnar.sortMergeJointrueEnable or disable columnar sortMergeJoin. This should be set with preferSortMergeJoin=false.
spark.gluten.sql.columnar.tableCachefalseEnable or disable columnar table cache.
spark.gluten.sql.columnar.takeOrderedAndProjecttrue
spark.gluten.sql.columnar.uniontrueEnable or disable columnar union.
spark.gluten.sql.columnar.wholeStage.fallback.threshold-1The threshold for whether whole stage will fall back in AQE supported case by counting the number of ColumnarToRow & vanilla leaf node.
spark.gluten.sql.columnar.windowtrueEnable or disable columnar window.
spark.gluten.sql.columnar.window.group.limittrueEnable or disable columnar window group limit.
spark.gluten.sql.columnarSampleEnabledfalseDisable or enable columnar sample.
spark.gluten.sql.columnarToRowMemoryThreshold64MB
spark.gluten.sql.countDistinctWithoutExpandfalseConvert Count Distinct to a UDAF called count_distinct to prevent SparkPlanner converting it to Expand+Count. WARNING: When enabled, count distinct queries will fail to fallback!!!
spark.gluten.sql.extendedColumnPruning.enabledtrueDo extended nested column pruning for cases ignored by vanilla Spark.
spark.gluten.sql.fallbackEncryptedParquetfalseIf enabled, gluten will not offload scan when encrypted parquet files are detected
spark.gluten.sql.fallbackEncryptedParquet.limit10If supplied, limit number of files will be checked to determine encryption and falling back java scan
spark.gluten.sql.fallbackRegexpExpressionsfalseIf true, fall back all regexp expressions. There are a few incompatible cases between RE2 (used by native engine) and java.util.regex (used by Spark). User should enable this property if their incompatibility is intolerable.
spark.gluten.sql.injectNativePlanStringToExplainfalseWhen true, Gluten will inject native plan tree to Spark's explain output.
spark.gluten.sql.mergeTwoPhasesAggregate.enabledtrueWhether to merge two phases aggregate if there are no other operators between them.
spark.gluten.sql.native.arrow.reader.enabledfalseThis is config to specify whether to enable the native columnar csv reader
spark.gluten.sql.native.bloomFiltertrue
spark.gluten.sql.native.hive.writer.enabledtrueThis is config to specify whether to enable the native columnar writer for HiveFileFormat. Currently only supports HiveFileFormat with Parquet as the output file type.
spark.gluten.sql.native.hyperLogLog.Aggregatetrue
spark.gluten.sql.native.parquet.write.blockRows100000000
spark.gluten.sql.native.unionfalseEnable or disable native union where computation is completely offloaded to backend.
spark.gluten.sql.native.writeColumnMetadataExclusionListcommentNative write files does not support column metadata. Metadata in list would be removed to support native write files. Multiple values separated by commas.
spark.gluten.sql.native.writer.enabled<undefined>This is config to specify whether to enable the native columnar parquet/orc writer
spark.gluten.sql.orc.charType.scan.fallback.enabledtrueForce fallback for orc char type scan.
spark.gluten.sql.removeNativeWriteFilesSortAndProjecttrueWhen true, Gluten will remove the vanilla Spark V1Writes added sort and project for velox backend.
spark.gluten.sql.rewrite.dateTimestampComparisontrueRewrite the comparision between date and timestamp to timestamp comparison.For example from_unixtime(ts) > date will be rewritten to ts > to_unixtime(date)
spark.gluten.sql.scan.fileSchemeValidation.enabledtrueWhen true, enable file path scheme validation for scan. Validation will fail if file scheme is not supported by registered file systems, which will cause scan operator fall back.
spark.gluten.sql.supported.flattenNestedFunctionsand,orFlatten nested functions as one for optimization.
spark.gluten.sql.text.input.empty.as.defaultfalsetreat empty fields in CSV input as default values.
spark.gluten.sql.text.input.max.block.size8KBthe max block size for text input rows
spark.gluten.sql.validation.printStackOnFailurefalse
spark.gluten.storage.hdfsViewfs.enabledfalseIf enabled, gluten will convert the viewfs path to hdfs path in scala side
spark.gluten.supported.hive.udfsSupported hive udf names.
spark.gluten.supported.python.udfsSupported python udf names.
spark.gluten.supported.scala.udfsSupported scala udf names.
spark.gluten.ui.enabledtrueWhether to enable the gluten web UI, If true, attach the gluten UI page to the Spark web UI.

Gluten experimental configurations

KeyDefaultDescription
spark.gluten.auto.adjustStageResource.enabledfalseExperimental: If enabled, gluten will try to set the stage resource according to stage execution plan. Only worked when aqe is enabled at the same time!!
spark.gluten.auto.adjustStageResources.fallenNode.ratio.threshold0.5Experimental: Increase executor heap memory when stage contains fallen node count exceeds the total node count ratio.
spark.gluten.auto.adjustStageResources.heap.ratio2.0Experimental: Increase executor heap memory when match adjust stage resource rule.
spark.gluten.auto.adjustStageResources.offheap.ratio0.5Experimental: Decrease executor offheap memory when match adjust stage resource rule.
spark.gluten.memory.dynamic.offHeap.sizing.enabledfalseExperimental: When set to true, the offheap config (spark.memory.offHeap.size) will be ignored and instead we will consider onheap and offheap memory in combination, both counting towards the executor memory config (spark.executor.memory). We will make use of JVM APIs to determine how much onheap memory is use, alongside tracking offheap allocations made by Gluten. We will then proceed to enforcing a total memory quota, calculated by the sum of what memory is committed and in use in the Java heap. Since the calculation of the total quota happens as offheap allocation happens and not as JVM heap memory is allocated, it is possible that we can oversubscribe memory. Additionally, note that this change is experimental and may have performance implications.
spark.gluten.memory.dynamic.offHeap.sizing.memory.fraction0.6Experimental: Determines the memory fraction used to determine the total memory available for offheap and onheap allocations when the dynamic offheap sizing feature is enabled. The default is set to match spark.executor.memoryFraction.
spark.gluten.sql.columnar.cudffalseEnable or disable cudf support. This is an experimental feature.