Gluten is still under active development. Here is a list of supported operators and functions.
Since the same function may have different semantics between Presto and Spark, Velox implement the functions in Presto category, if we note a different semantics from Spark, then the function is implemented in Spark category. So Gluten will first try to find function in Velox‘s spark category, if a function isn’t implemented then refer to Presto category.
We use some notations to describe the supporting status of operators/functions in the tables below, they are:
Value | Description |
---|---|
S | Supported. Gluten or Velox supports fully. |
S* | Mark for foldable expression that will be converted to alias after spark's optimization. |
[Blank Cell] | Not applicable case or needs to confirm. |
PS | Partial Support. Velox only partially supports it. |
NS | Not Supported. Velox backend does not support it. |
And also some notations for the function implementation's restrictions:
Value | Description |
---|---|
Mismatched | Some functions are implemented by Velox, but have different semantics from Apache Spark, we mark them as “Mismatched”. |
ANSI OFF | Gluten doesn't support ANSI mode. If it is enabled, Gluten will fall back to Vanilla Spark. |
Gluten supports 30+ operators (Drag to right to see all data types)
Executor | Description | Gluten Name | Velox Name | BOOLEAN | BYTE | SHORT | INT | LONG | FLOAT | DOUBLE | STRING | NULL | BINARY | ARRAY | MAP | STRUCT(ROW) | DATE | TIMESTAMP | DECIMAL | CALENDAR | UDT |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
FileSourceScanExec | Reading data from files, often from Hive tables | FileSourceScanExecTransformer | TableScanNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
BatchScanExec | The backend for most file input | BatchScanExecTransformer | TableScanNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
FilterExec | The backend for most filter statements | FilterExecTransformer | FilterNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
ProjectExec | The backend for most select, withColumn and dropColumn statements | ProjectExecTransformer | ProjectNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
HashAggregateExec | The backend for hash based aggregations | HashAggregateBaseTransformer | AggregationNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
BroadcastHashJoinExec | Implementation of join using broadcast data | BroadcastHashJoinExecTransformer | HashJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
ShuffledHashJoinExec | Implementation of join using hashed shuffled data | ShuffleHashJoinExecTransformer | HashJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
SortExec | The backend for the sort operator | SortExecTransformer | OrderByNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
SortMergeJoinExec | Sort merge join, replacing with shuffled hash join | SortMergeJoinExecTransformer | MergeJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
WindowExec | Window operator backend | WindowExecTransformer | WindowNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
GlobalLimitExec | Limiting of results across partitions | LimitTransformer | LimitNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
LocalLimitExec | Per-partition limiting of results | LimitTransformer | LimitNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
ExpandExec | The backend for the expand operator | ExpandExecTransformer | GroupIdNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
UnionExec | The backend for the union operator | UnionExecTransformer | N | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
DataWritingCommandExec | Writing data | Y | TableWriteNode | S | S | S | S | S | S | S | S | S | S | S | NS | S | S | NS | S | NS | NS |
CartesianProductExec | Implementation of join using brute force | CartesianProductExecTransformer | NestedLoopJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
ShuffleExchangeExec | The backend for most data being exchanged between processes | ColumnarShuffleExchangeExec | ExchangeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
The unnest operation expands arrays and maps into separate columns | N | UnnestNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
The top-n operation reorders a dataset based on one or more identified sort fields as well as a sorting order | N | TopNNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
The partitioned output operation redistributes data based on zero or more distribution fields | N | PartitionedOutputNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
The values operation returns specified data | N | ValuesNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
A receiving operation that merges multiple ordered streams to maintain orderedness | N | MergeExchangeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
An operation that merges multiple ordered streams to maintain orderedness | N | LocalMergeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
Partitions input data into multiple streams or combines data from multiple streams into a single stream | N | LocalPartitionNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
The enforce single row operation checks that input contains at most one row and returns that row unmodified | N | EnforceSingleRowNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
The assign unique id operation adds one column at the end of the input columns with unique value per row | N | AssignUniqueIdNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | S | S | S | S | S | |
ReusedExchangeExec | A wrapper for reused exchange to have different output | ReusedExchangeExec | N | ||||||||||||||||||
CollectLimitExec | Reduce to single partition and apply limit | ColumnarCollectLimitExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
CollectTailExec | Collect the tail x elements from dataframe | ColumnarCollectTailExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
BroadcastExchangeExec | The backend for broadcast exchange of data | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS |
ObjectHashAggregateExec | The backend for hash based aggregations supporting TypedImperativeAggregate functions | HashAggregateExecBaseTransformer | N | ||||||||||||||||||
SortAggregateExec | The backend for sort based aggregations | HashAggregateExecBaseTransformer (Partially supported) | N | ||||||||||||||||||
CoalesceExec | Reduce the partition numbers | CoalesceExecTransformer | N | ||||||||||||||||||
GenerateExec | The backend for operations that generate more output rows than input rows like explode | GenerateExecTransformer | UnnestNode | ||||||||||||||||||
RangeExec | The backend for range operator | ColumnarRangeExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
SampleExec | The backend for the sample operator | SampleExecTransformer | N | ||||||||||||||||||
SubqueryBroadcastExec | Plan to collect and transform the broadcast key values | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS |
TakeOrderedAndProjectExec | Take the first limit elements as defined by the sortOrder, and do projection if needed | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS |
CustomShuffleReaderExec | A wrapper of shuffle query stage | N | N | ||||||||||||||||||
InMemoryTableScanExec | Implementation of InMemory Table Scan | Y | Y | ||||||||||||||||||
BroadcastNestedLoopJoinExec | Implementation of join using brute force. Full outer joins and joins where the broadcast side matches the join side (e.g.: LeftOuter with left broadcast) are not supported | BroadcastNestedLoopJoinExecTransformer | NestedLoopJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
AggregateInPandasExec | The backend for an Aggregation Pandas UDF, this accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
ArrowEvalPythonExec | The backend of the Scalar Pandas UDFs. Accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
FlatMapGroupsInPandasExec | The backend for Flat Map Groups Pandas UDF, Accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
MapInPandasExec | The backend for Map Pandas Iterator UDF. Accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
WindowInPandasExec | The backend for Window Aggregation Pandas UDF, Accelerates the data transfer between the Java process and the Python process | N | N | ||||||||||||||||||
HiveTableScanExec | The Hive table scan operator. Column and partition pruning are both handled | Y | Y | ||||||||||||||||||
InsertIntoHiveTable | Command for writing data out to a Hive table | Y | Y | ||||||||||||||||||
Velox2Row | Convert Velox format to Row format | Y | Y | S | S | S | S | S | S | S | S | NS | S | NS | NS | NS | S | S | NS | NS | NS |
Velox2Arrow | Convert Velox format to Arrow format | Y | Y | S | S | S | S | S | S | S | S | NS | S | S | S | S | S | NS | S | NS | NS |
WindowGroupLimitExec | Optimize window with rank like function with filter on it | Y | Y | S | S | S | S | S | S | S | S | NS | S | S | S | S | S | NS | S | NS | NS |
Spark categorizes built-in functions into four types: Scalar Functions, Aggregate Functions, Window Functions, and Generator Functions. In Gluten, function support is automatically generated by a script and maintained in separate files.
When running the script, the --spark_home
arg should be set to either:
install_spark_resources.sh
script to get a directory with the necessary resource files:# Define a directory to use for the Spark files and the latest Spark version export spark_dir=/tmp/spark export spark_version=3.5 # Run the install_spark_resources.sh script .github/workflows/util/install_spark_resources.sh ${spark_version} ${spark_dir}After running the
install_spark_resources.sh
, the --spark_home
for the document generation script will be something like: --spark_home=${spark_dir}/shims/spark35/spark_home"
Use the following command to generate and update the support status:
python3 tools/scripts/gen-function-support-docs.py --spark_home=/path/to/spark_source_code
Please check the links below for the detailed support status of each category:
Scalar Functions Support Status
Aggregate Functions Support Status