blob: b454f613eb1e349c637b5851721539d2390bd54f [file] [log] [blame] [view]
---
layout: page
title: Velox Backend's Supported Operators & Functions
nav_order: 4
---
# The Operators and Functions Support Progress
Gluten is still under active development. Here is a list of supported operators and functions.
Since the same function may have different semantics between Presto and Spark, Velox implement the functions in Presto category, if we note a
different semantics from Spark, then the function is implemented in Spark category. So Gluten will first try to find function in Velox's spark
category, if a function isn't implemented then refer to Presto category.
We use some notations to describe the supporting status of operators/functions in the tables below, they are:
| Value | Description |
|--------------|------------------------------------------------------------------------------------------|
| S | Supported. Gluten or Velox supports fully. |
| S* | Mark for foldable expression that will be converted to alias after spark's optimization. |
| [Blank Cell] | Not applicable case or needs to confirm. |
| PS | Partial Support. Velox only partially supports it. |
| NS | Not Supported. Velox backend does not support it. |
And also some notations for the function implementation's restrictions:
| Value | Description |
|------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Mismatched | Some functions are implemented by Velox, but have different semantics from Apache Spark, we mark them as "Mismatched". |
| ANSI OFF | Gluten doesn't support [ANSI mode](https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html). If it is enabled, Gluten will fall back to Vanilla Spark. |
## Operator Map
Gluten supports 30+ operators (Drag to right to see all data types)
| Executor | Description | Gluten Name | Velox Name | BOOLEAN | BYTE | SHORT | INT | LONG | FLOAT | DOUBLE | STRING | NULL | BINARY | ARRAY | MAP | STRUCT(ROW) | DATE | TIMESTAMP | DECIMAL | CALENDAR | UDT |
|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|-----------------------|---------|------|-------|-----|------|-------|--------|--------|------|--------|-------|-----|-------------|------|-----------|---------|----------|-----|
| FileSourceScanExec | Reading data from files, often from Hive tables | FileSourceScanExecTransformer | TableScanNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| BatchScanExec | The backend for most file input | BatchScanExecTransformer | TableScanNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| FilterExec | The backend for most filter statements | FilterExecTransformer | FilterNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| ProjectExec | The backend for most select, withColumn and dropColumn statements | ProjectExecTransformer | ProjectNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| HashAggregateExec | The backend for hash based aggregations | HashAggregateBaseTransformer | AggregationNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| BroadcastHashJoinExec | Implementation of join using broadcast data | BroadcastHashJoinExecTransformer | HashJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| ShuffledHashJoinExec | Implementation of join using hashed shuffled data | ShuffleHashJoinExecTransformer | HashJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| SortExec | The backend for the sort operator | SortExecTransformer | OrderByNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| SortMergeJoinExec | Sort merge join, replacing with shuffled hash join | SortMergeJoinExecTransformer | MergeJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| WindowExec | Window operator backend | WindowExecTransformer | WindowNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| GlobalLimitExec | Limiting of results across partitions | LimitTransformer | LimitNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| LocalLimitExec | Per-partition limiting of results | LimitTransformer | LimitNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| ExpandExec | The backend for the expand operator | ExpandExecTransformer | GroupIdNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| UnionExec | The backend for the union operator | UnionExecTransformer | N | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| DataWritingCommandExec | Writing data | Y | TableWriteNode | S | S | S | S | S | S | S | S | S | S | S | NS | S | S | NS | S | NS | NS |
| CartesianProductExec | Implementation of join using brute force | CartesianProductExecTransformer | NestedLoopJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| ShuffleExchangeExec | The backend for most data being exchanged between processes | ColumnarShuffleExchangeExec | ExchangeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| | The unnest operation expands arrays and maps into separate columns | N | UnnestNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| | The top-n operation reorders a dataset based on one or more identified sort fields as well as a sorting order | N | TopNNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| | The partitioned output operation redistributes data based on zero or more distribution fields | N | PartitionedOutputNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| | The values operation returns specified data | N | ValuesNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| | A receiving operation that merges multiple ordered streams to maintain orderedness | N | MergeExchangeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| | An operation that merges multiple ordered streams to maintain orderedness | N | LocalMergeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| | Partitions input data into multiple streams or combines data from multiple streams into a single stream | N | LocalPartitionNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| | The enforce single row operation checks that input contains at most one row and returns that row unmodified | N | EnforceSingleRowNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| | The assign unique id operation adds one column at the end of the input columns with unique value per row | N | AssignUniqueIdNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | S | S | S | S | S |
| ReusedExchangeExec | A wrapper for reused exchange to have different output | ReusedExchangeExec | N | | | | | | | | | | | | | | | | | | |
| CollectLimitExec | Reduce to single partition and apply limit | ColumnarCollectLimitExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
| CollectTailExec | Collect the tail `x` elements from dataframe | ColumnarCollectTailExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
| BroadcastExchangeExec | The backend for broadcast exchange of data | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS |
| ObjectHashAggregateExec | The backend for hash based aggregations supporting TypedImperativeAggregate functions | HashAggregateExecBaseTransformer | N | | | | | | | | | | | | | | | | | | |
| SortAggregateExec | The backend for sort based aggregations | HashAggregateExecBaseTransformer (Partially supported) | N | | | | | | | | | | | | | | | | | | |
| CoalesceExec | Reduce the partition numbers | CoalesceExecTransformer | N | | | | | | | | | | | | | | | | | | |
| GenerateExec | The backend for operations that generate more output rows than input rows like explode | GenerateExecTransformer | UnnestNode | | | | | | | | | | | | | | | | | | |
| RangeExec | The backend for range operator | ColumnarRangeExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S |
| SampleExec | The backend for the sample operator | SampleExecTransformer | N | | | | | | | | | | | | | | | | | | |
| SubqueryBroadcastExec | Plan to collect and transform the broadcast key values | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS |
| TakeOrderedAndProjectExec | Take the first limit elements as defined by the sortOrder, and do projection if needed | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS |
| CustomShuffleReaderExec | A wrapper of shuffle query stage | N | N | | | | | | | | | | | | | | | | | | |
| InMemoryTableScanExec | Implementation of InMemory Table Scan | Y | Y | | | | | | | | | | | | | | | | | | |
| BroadcastNestedLoopJoinExec | Implementation of join using brute force. Full outer joins and joins where the broadcast side matches the join side (e.g.: LeftOuter with left broadcast) are not supported | BroadcastNestedLoopJoinExecTransformer | NestedLoopJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS |
| AggregateInPandasExec | The backend for an Aggregation Pandas UDF, this accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | |
| ArrowEvalPythonExec | The backend of the Scalar Pandas UDFs. Accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | |
| FlatMapGroupsInPandasExec | The backend for Flat Map Groups Pandas UDF, Accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | |
| MapInPandasExec | The backend for Map Pandas Iterator UDF. Accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | |
| WindowInPandasExec | The backend for Window Aggregation Pandas UDF, Accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | |
| HiveTableScanExec | The Hive table scan operator. Column and partition pruning are both handled | Y | Y | | | | | | | | | | | | | | | | | | |
| InsertIntoHiveTable | Command for writing data out to a Hive table | Y | Y | | | | | | | | | | | | | | | | | | |
| Velox2Row | Convert Velox format to Row format | Y | Y | S | S | S | S | S | S | S | S | NS | S | NS | NS | NS | S | S | NS | NS | NS |
| Velox2Arrow | Convert Velox format to Arrow format | Y | Y | S | S | S | S | S | S | S | S | NS | S | S | S | S | S | NS | S | NS | NS |
| WindowGroupLimitExec | Optimize window with rank like function with filter on it | Y | Y | S | S | S | S | S | S | S | S | NS | S | S | S | S | S | NS | S | NS | NS |
## Function Support Status
Spark categorizes built-in functions into four types: Scalar Functions, Aggregate Functions,
Window Functions, and Generator Functions.
In Gluten, function support is automatically generated by a script and maintained in separate files.
When running the script, the `--spark_home` arg should be set to either:
* The directory containing the Spark source code for the latest supported Spark version in Gluten, and the Spark
project must be built from source.
* Or use the `install_spark_resources.sh` script to get a directory with the necessary resource files:
```
# Define a directory to use for the Spark files and the latest Spark version
export spark_dir=/tmp/spark
export spark_version=3.5
# Run the install_spark_resources.sh script
.github/workflows/util/install_spark_resources.sh ${spark_version} ${spark_dir}
```
After running the `install_spark_resources.sh`, the `--spark_home` for the document generation script will be
something like: `--spark_home=${spark_dir}/shims/spark35/spark_home"`
Use the following command to generate and update the support status:
```shell
python3 tools/scripts/gen-function-support-docs.py --spark_home=/path/to/spark_source_code
```
Please check the links below for the detailed support status of each category:
[Scalar Functions Support Status](./velox-backend-scalar-function-support.md)
[Aggregate Functions Support Status](./velox-backend-aggregate-function-support.md)
[Window Functions Support Status](./velox-backend-window-function-support.md)
[Generator Functions Support Status](./velox-backend-generator-function-support.md)