layout: page title: Velox Backend's Supported Operators & Functions nav_order: 4

The Operators and Functions Support Progress

Gluten is still under active development. Here is a list of supported operators and functions.

Since the same function may have different semantics between Presto and Spark, Velox implement the functions in Presto category, if we note a different semantics from Spark, then the function is implemented in Spark category. So Gluten will first try to find function in Velox‘s spark category, if a function isn’t implemented then refer to Presto category.

We use some notations to describe the supporting status of operators/functions in the tables below, they are:

ValueDescription
SSupported. Gluten or Velox supports fully.
S*Mark for foldable expression that will be converted to alias after spark's optimization.
[Blank Cell]Not applicable case or needs to confirm.
PSPartial Support. Velox only partially supports it.
NSNot Supported. Velox backend does not support it.

And also some notations for the function implementation's restrictions:

ValueDescription
MismatchedSome functions are implemented by Velox, but have different semantics from Apache Spark, we mark them as “Mismatched”.
ANSI OFFGluten doesn't support ANSI mode. If it is enabled, Gluten will fall back to Vanilla Spark.

Operator Map

Gluten supports 30+ operators (Drag to right to see all data types)

ExecutorDescriptionGluten NameVelox NameBOOLEANBYTESHORTINTLONGFLOATDOUBLESTRINGNULLBINARYARRAYMAPSTRUCT(ROW)DATETIMESTAMPDECIMALCALENDARUDT
FileSourceScanExecReading data from files, often from Hive tablesFileSourceScanExecTransformerTableScanNodeSSSSSSSSSSNSNSNSSNSNSNSNS
BatchScanExecThe backend for most file inputBatchScanExecTransformerTableScanNodeSSSSSSSSSSNSNSNSSNSNSNSNS
FilterExecThe backend for most filter statementsFilterExecTransformerFilterNodeSSSSSSSSSSNSNSNSSNSNSNSNS
ProjectExecThe backend for most select, withColumn and dropColumn statementsProjectExecTransformerProjectNodeSSSSSSSSSSNSNSNSSNSNSNSNS
HashAggregateExecThe backend for hash based aggregationsHashAggregateBaseTransformerAggregationNodeSSSSSSSSSSNSNSNSSNSNSNSNS
BroadcastHashJoinExecImplementation of join using broadcast dataBroadcastHashJoinExecTransformerHashJoinNodeSSSSSSSSSSNSNSNSSNSNSNSNS
ShuffledHashJoinExecImplementation of join using hashed shuffled dataShuffleHashJoinExecTransformerHashJoinNodeSSSSSSSSSSNSNSNSSNSNSNSNS
SortExecThe backend for the sort operatorSortExecTransformerOrderByNodeSSSSSSSSSSNSNSNSSNSNSNSNS
SortMergeJoinExecSort merge join, replacing with shuffled hash joinSortMergeJoinExecTransformerMergeJoinNodeSSSSSSSSSSNSNSNSSNSNSNSNS
WindowExecWindow operator backendWindowExecTransformerWindowNodeSSSSSSSSSSNSNSNSSNSNSNSNS
GlobalLimitExecLimiting of results across partitionsLimitTransformerLimitNodeSSSSSSSSSSNSNSNSSNSNSNSNS
LocalLimitExecPer-partition limiting of resultsLimitTransformerLimitNodeSSSSSSSSSSNSNSNSSNSNSNSNS
ExpandExecThe backend for the expand operatorExpandExecTransformerGroupIdNodeSSSSSSSSSSNSNSNSSNSNSNSNS
UnionExecThe backend for the union operatorUnionExecTransformerNSSSSSSSSSSNSNSNSSNSNSNSNS
DataWritingCommandExecWriting dataYTableWriteNodeSSSSSSSSSSSNSSSNSSNSNS
CartesianProductExecImplementation of join using brute forceCartesianProductExecTransformerNestedLoopJoinNodeSSSSSSSSSSNSNSNSSNSNSNSNS
ShuffleExchangeExecThe backend for most data being exchanged between processesColumnarShuffleExchangeExecExchangeNodeNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNS
The unnest operation expands arrays and maps into separate columnsNUnnestNodeNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNS
The top-n operation reorders a dataset based on one or more identified sort fields as well as a sorting orderNTopNNodeNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNS
The partitioned output operation redistributes data based on zero or more distribution fieldsNPartitionedOutputNodeNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNS
The values operation returns specified dataNValuesNodeNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNS
A receiving operation that merges multiple ordered streams to maintain orderednessNMergeExchangeNodeNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNS
An operation that merges multiple ordered streams to maintain orderednessNLocalMergeNodeNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNS
Partitions input data into multiple streams or combines data from multiple streams into a single streamNLocalPartitionNodeNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNS
The enforce single row operation checks that input contains at most one row and returns that row unmodifiedNEnforceSingleRowNodeNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNSNS
The assign unique id operation adds one column at the end of the input columns with unique value per rowNAssignUniqueIdNodeNSNSNSNSNSNSNSNSNSNSNSNSNSSSSSS
ReusedExchangeExecA wrapper for reused exchange to have different outputReusedExchangeExecN
CollectLimitExecReduce to single partition and apply limitColumnarCollectLimitExecNSSSSSSSSSSSSSSSSSS
CollectTailExecCollect the tail x elements from dataframeColumnarCollectTailExecNSSSSSSSSSSSSSSSSSS
BroadcastExchangeExecThe backend for broadcast exchange of dataYYSSSSSSSSSSNSNSNSSNSSNSNS
ObjectHashAggregateExecThe backend for hash based aggregations supporting TypedImperativeAggregate functionsHashAggregateExecBaseTransformerN
SortAggregateExecThe backend for sort based aggregationsHashAggregateExecBaseTransformer (Partially supported)N
CoalesceExecReduce the partition numbersCoalesceExecTransformerN
GenerateExecThe backend for operations that generate more output rows than input rows like explodeGenerateExecTransformerUnnestNode
RangeExecThe backend for range operatorColumnarRangeExecNSSSSSSSSSSSSSSSSSS
SampleExecThe backend for the sample operatorSampleExecTransformerN
SubqueryBroadcastExecPlan to collect and transform the broadcast key valuesYYSSSSSSSSSSNSNSNSSNSSNSNS
TakeOrderedAndProjectExecTake the first limit elements as defined by the sortOrder, and do projection if neededYYSSSSSSSSSSNSNSNSSNSSNSNS
CustomShuffleReaderExecA wrapper of shuffle query stageNN
InMemoryTableScanExecImplementation of InMemory Table ScanYY
BroadcastNestedLoopJoinExecImplementation of join using brute force. Full outer joins and joins where the broadcast side matches the join side (e.g.: LeftOuter with left broadcast) are not supportedBroadcastNestedLoopJoinExecTransformerNestedLoopJoinNodeSSSSSSSSSSNSNSNSSNSNSNSNS
AggregateInPandasExecThe backend for an Aggregation Pandas UDF, this accelerates the data transfer between the Java process and the Python processNN
ArrowEvalPythonExecThe backend of the Scalar Pandas UDFs. Accelerates the data transfer between the Java process and the Python processNN
FlatMapGroupsInPandasExecThe backend for Flat Map Groups Pandas UDF, Accelerates the data transfer between the Java process and the Python processNN
MapInPandasExecThe backend for Map Pandas Iterator UDF. Accelerates the data transfer between the Java process and the Python processNN
WindowInPandasExecThe backend for Window Aggregation Pandas UDF, Accelerates the data transfer between the Java process and the Python processNN
HiveTableScanExecThe Hive table scan operator. Column and partition pruning are both handledYY
InsertIntoHiveTableCommand for writing data out to a Hive tableYY
Velox2RowConvert Velox format to Row formatYYSSSSSSSSNSSNSNSNSSSNSNSNS
Velox2ArrowConvert Velox format to Arrow formatYYSSSSSSSSNSSSSSSNSSNSNS
WindowGroupLimitExecOptimize window with rank like function with filter on itYYSSSSSSSSNSSSSSSNSSNSNS

Function Support Status

Spark categorizes built-in functions into four types: Scalar Functions, Aggregate Functions, Window Functions, and Generator Functions. In Gluten, function support is automatically generated by a script and maintained in separate files.

When running the script, the --spark_home arg should be set to either:

  • The directory containing the Spark source code for the latest supported Spark version in Gluten, and the Spark project must be built from source.
  • Or use the install_spark_resources.sh script to get a directory with the necessary resource files:
    # Define a directory to use for the Spark files and the latest Spark version
    export spark_dir=/tmp/spark
    export spark_version=3.5
    
    # Run the install_spark_resources.sh script
    .github/workflows/util/install_spark_resources.sh ${spark_version} ${spark_dir}
    
    After running the install_spark_resources.sh, the --spark_home for the document generation script will be something like: --spark_home=${spark_dir}/shims/spark35/spark_home"

Use the following command to generate and update the support status:

python3 tools/scripts/gen-function-support-docs.py --spark_home=/path/to/spark_source_code

Please check the links below for the detailed support status of each category:

Scalar Functions Support Status

Aggregate Functions Support Status

Window Functions Support Status

Generator Functions Support Status