layout: page title: Configuration nav_order: 16

Gluten Velox backend configurations

KeyDefaultDescription
spark.gluten.sql.columnar.backend.velox.IOThreads<undefined>The Size of the IO thread pool in the Connector. This thread pool is used for split preloading and DirectBufferedInput. By default, the value is the same as the maximum task slots per Spark executor.
spark.gluten.sql.columnar.backend.velox.SplitPreloadPerDriver2The split preload per task
spark.gluten.sql.columnar.backend.velox.abandonPartialAggregationMinPct90If partial aggregation aggregationPct greater than this value, partial aggregation may be early abandoned. Note: this option only works when flushable partial aggregation is enabled. Ignored when spark.gluten.sql.columnar.backend.velox.flushablePartialAggregation=false.
spark.gluten.sql.columnar.backend.velox.abandonPartialAggregationMinRows100000If partial aggregation input rows number greater than this value, partial aggregation may be early abandoned. Note: this option only works when flushable partial aggregation is enabled. Ignored when spark.gluten.sql.columnar.backend.velox.flushablePartialAggregation=false.
spark.gluten.sql.columnar.backend.velox.asyncTimeoutOnTaskStopping30000msTimeout for asynchronous execution when task is being stopped in Velox backend. It's recommended to set to a number larger than network connection timeout that the possible aysnc tasks are relying on.
spark.gluten.sql.columnar.backend.velox.bloomFilter.expectedNumItems1000000The default number of expected items for the velox bloomfilter: ‘spark.bloom_filter.expected_num_items’
spark.gluten.sql.columnar.backend.velox.bloomFilter.maxNumBits4194304The max number of bits to use for the velox bloom filter: ‘spark.bloom_filter.max_num_bits’
spark.gluten.sql.columnar.backend.velox.bloomFilter.numBits8388608The default number of bits to use for the velox bloom filter: ‘spark.bloom_filter.num_bits’
spark.gluten.sql.columnar.backend.velox.cacheEnabledfalseEnable Velox cache, default off. It's recommended to enablesoft-affinity as well when enable velox cache.
spark.gluten.sql.columnar.backend.velox.cachePrefetchMinPct0Set prefetch cache min pct for velox file scan
spark.gluten.sql.columnar.backend.velox.checkUsageLeaktrueEnable check memory usage leak.
spark.gluten.sql.columnar.backend.velox.cudf.memoryPercent50The initial percent of GPU memory to allocate for memory resource for one thread.
spark.gluten.sql.columnar.backend.velox.cudf.memoryResourceasyncGPU RMM memory resource.
spark.gluten.sql.columnar.backend.velox.directorySizeGuess32KBDeprecated, rename to spark.gluten.sql.columnar.backend.velox.footerEstimatedSize
spark.gluten.sql.columnar.backend.velox.fileHandleCacheEnabledfalseDisables caching if false. File handle cache should be disabled if files are mutable, i.e. file content may change while file path stays the same.
spark.gluten.sql.columnar.backend.velox.filePreloadThreshold1MBSet the file preload threshold for velox file scan, refer to Velox's file-preload-threshold
spark.gluten.sql.columnar.backend.velox.floatingPointModelooseConfig used to control the tolerance of floating point operations alignment with Spark. When the mode is set to strict, flushing is disabled for sum(float/double)and avg(float/double). When set to loose, flushing will be enabled.
spark.gluten.sql.columnar.backend.velox.flushablePartialAggregationtrueEnable flushable aggregation. If true, Gluten will try converting regular aggregation into Velox's flushable aggregation when applicable. A flushable aggregation could emit intermediate result at anytime when memory is full / data reduction ratio is low.
spark.gluten.sql.columnar.backend.velox.footerEstimatedSize32KBSet the footer estimated size for velox file scan, refer to Velox's footer-estimated-size
spark.gluten.sql.columnar.backend.velox.loadQuantum256MBSet the load quantum for velox file scan, recommend to use the default value (256MB) for performance consideration. If Velox cache is enabled, it can be 8MB at most.
spark.gluten.sql.columnar.backend.velox.maxCoalescedBytes64MBSet the max coalesced bytes for velox file scan
spark.gluten.sql.columnar.backend.velox.maxCoalescedDistance512KBSet the max coalesced distance bytes for velox file scan
spark.gluten.sql.columnar.backend.velox.maxExtendedPartialAggregationMemoryRatio0.15Set the max extended memory of partial aggregation as maxExtendedPartialAggregationMemoryRatio of offheap size. Note: this option only works when flushable partial aggregation is enabled. Ignored when spark.gluten.sql.columnar.backend.velox.flushablePartialAggregation=false.
spark.gluten.sql.columnar.backend.velox.maxPartialAggregationMemory<undefined>Set the max memory of partial aggregation in bytes. When this option is set to a value greater than 0, it will override spark.gluten.sql.columnar.backend.velox.maxPartialAggregationMemoryRatio. Note: this option only works when flushable partial aggregation is enabled. Ignored when spark.gluten.sql.columnar.backend.velox.flushablePartialAggregation=false.
spark.gluten.sql.columnar.backend.velox.maxPartialAggregationMemoryRatio0.1Set the max memory of partial aggregation as maxPartialAggregationMemoryRatio of offheap size. Note: this option only works when flushable partial aggregation is enabled. Ignored when spark.gluten.sql.columnar.backend.velox.flushablePartialAggregation=false.
spark.gluten.sql.columnar.backend.velox.maxPartitionsPerWritersSession10000Maximum number of partitions per a single table writer instance.
spark.gluten.sql.columnar.backend.velox.maxSpillBytes100GThe maximum file size of a query
spark.gluten.sql.columnar.backend.velox.maxSpillFileSize1GBThe maximum size of a single spill file created
spark.gluten.sql.columnar.backend.velox.maxSpillLevel4The max allowed spilling level with zero being the initial spilling level
spark.gluten.sql.columnar.backend.velox.maxSpillRunRows3MThe maximum row size of a single spill run
spark.gluten.sql.columnar.backend.velox.memCacheSize1GBThe memory cache size
spark.gluten.sql.columnar.backend.velox.memInitCapacity8MBThe initial memory capacity to reserve for a newly created Velox query memory pool.
spark.gluten.sql.columnar.backend.velox.memoryPoolCapacityTransferAcrossTaskstrueWhether to allow memory capacity transfer between memory pools from different tasks.
spark.gluten.sql.columnar.backend.velox.memoryUseHugePagesfalseUse explicit huge pages for Velox memory allocation.
spark.gluten.sql.columnar.backend.velox.orc.scan.enabledtrueEnable velox orc scan. If disabled, vanilla spark orc scan will be used.
spark.gluten.sql.columnar.backend.velox.prefetchRowGroups1Set the prefetch row groups for velox file scan
spark.gluten.sql.columnar.backend.velox.propagateIgnoreNullKeystrueIf enabled, we will identify aggregation followed by an inner join on the grouping keys, and mark the ignoreNullKeys flag to true to avoid unnecessary aggregation on null keys.
spark.gluten.sql.columnar.backend.velox.queryTraceEnabledfalseEnable query tracing flag.
spark.gluten.sql.columnar.backend.velox.reclaimMaxWaitMs3600000msThe max time in ms to wait for memory reclaim.
spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInputtrueIf true, combine small columnar batches together before sending to shuffle. The default minimum output batch size is equal to 0.25 * spark.gluten.sql.columnar.maxBatchSize
spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInput.minSize<undefined>The minimum batch size for shuffle. If size of an input batch is smaller than the value, it will be combined with other batches before sending to shuffle. Only functions when spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInput is set to true. Default value: 0.25 *
spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInputOuptut.minSize<undefined>The minimum batch size for shuffle input and output. If size of an input batch is smaller than the value, it will be combined with other batches before sending to shuffle. The same applies for batches output by shuffle read. Only functions when spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInput or spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleOutput is set to true. Default value: 0.25 *
spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleOutputfalseIf true, combine small columnar batches together right after shuffle read. The default minimum output batch size is equal to 0.25 * spark.gluten.sql.columnar.maxBatchSize
spark.gluten.sql.columnar.backend.velox.showTaskMetricsWhenFinishedfalseShow velox full task metrics when finished.
spark.gluten.sql.columnar.backend.velox.spillFileSystemlocalThe filesystem used to store spill data. local: The local file system. heap-over-local: Write file to JVM heap if having extra heap space. Otherwise write to local file system.
spark.gluten.sql.columnar.backend.velox.spillStrategyautonone: Disable spill on Velox backend; auto: Let Spark memory manager manage Velox's spilling
spark.gluten.sql.columnar.backend.velox.ssdCacheIOThreads1The IO threads for cache promoting
spark.gluten.sql.columnar.backend.velox.ssdCachePath/tmpThe folder to store the cache files, better on SSD
spark.gluten.sql.columnar.backend.velox.ssdCacheShards1The cache shards
spark.gluten.sql.columnar.backend.velox.ssdCacheSize1GBThe SSD cache size, will do memory caching only if this value = 0
spark.gluten.sql.columnar.backend.velox.ssdCheckpointIntervalBytes0Checkpoint after every ‘checkpointIntervalBytes’ for SSD cache. 0 means no checkpointing.
spark.gluten.sql.columnar.backend.velox.ssdChecksumEnabledfalseIf true, checksum write to SSD is enabled.
spark.gluten.sql.columnar.backend.velox.ssdChecksumReadVerificationEnabledfalseIf true, checksum read verification from SSD is enabled.
spark.gluten.sql.columnar.backend.velox.ssdDisableFileCowfalseTrue if copy on write should be disabled.
spark.gluten.sql.columnar.backend.velox.ssdODirectfalseThe O_DIRECT flag for cache writing
spark.gluten.velox.castFromVarcharAddTrimNodefalseIf true, will add a trim node which has the same sementic as vanilla Spark to CAST-from-varchar.Otherwise, do nothing.
spark.gluten.velox.fs.s3a.connect.timeout200sTimeout for AWS s3 connection.

Gluten Velox backend experimental configurations

KeyDefaultDescription
spark.gluten.velox.offHeapBroadcastBuildRelation.enabledfalseExperimental: If enabled, broadcast build relation will use offheap memory. Otherwise, broadcast build relation will use onheap memory.