| --- |
| layout: page |
| title: Velox Backend's Supported Operators & Functions |
| nav_order: 4 |
| --- |
| |
| # The Operators and Functions Support Progress |
| |
| Gluten is still under active development. Here is a list of supported operators and functions. |
| |
| Since the same function may have different semantics between Presto and Spark, Velox implement the functions in Presto category, if we note a |
| different semantics from Spark, then the function is implemented in Spark category. So Gluten will first try to find function in Velox's spark |
| category, if a function isn't implemented then refer to Presto category. |
| |
| We use some notations to describe the supporting status of operators/functions in the tables below, they are: |
| |
| | Value | Description | |
| |--------------|------------------------------------------------------------------------------------------| |
| | S | Supported. Gluten or Velox supports fully. | |
| | S* | Mark for foldable expression that will be converted to alias after spark's optimization. | |
| | [Blank Cell] | Not applicable case or needs to confirm. | |
| | PS | Partial Support. Velox only partially supports it. | |
| | NS | Not Supported. Velox backend does not support it. | |
| |
| And also some notations for the function implementation's restrictions: |
| |
| | Value | Description | |
| |------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| | Mismatched | Some functions are implemented by Velox, but have different semantics from Apache Spark, we mark them as "Mismatched". | |
| | ANSI OFF | Gluten doesn't support [ANSI mode](https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html). If it is enabled, Gluten will fall back to Vanilla Spark. | |
| |
| ## Operator Map |
| |
| Gluten supports 30+ operators (Drag to right to see all data types) |
| |
| | Executor | Description | Gluten Name | Velox Name | BOOLEAN | BYTE | SHORT | INT | LONG | FLOAT | DOUBLE | STRING | NULL | BINARY | ARRAY | MAP | STRUCT(ROW) | DATE | TIMESTAMP | DECIMAL | CALENDAR | UDT | |
| |-----------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|-----------------------|---------|------|-------|-----|------|-------|--------|--------|------|--------|-------|-----|-------------|------|-----------|---------|----------|-----| |
| | FileSourceScanExec | Reading data from files, often from Hive tables | FileSourceScanExecTransformer | TableScanNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | BatchScanExec | The backend for most file input | BatchScanExecTransformer | TableScanNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | FilterExec | The backend for most filter statements | FilterExecTransformer | FilterNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | ProjectExec | The backend for most select, withColumn and dropColumn statements | ProjectExecTransformer | ProjectNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | HashAggregateExec | The backend for hash based aggregations | HashAggregateBaseTransformer | AggregationNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | BroadcastHashJoinExec | Implementation of join using broadcast data | BroadcastHashJoinExecTransformer | HashJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | ShuffledHashJoinExec | Implementation of join using hashed shuffled data | ShuffleHashJoinExecTransformer | HashJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | SortExec | The backend for the sort operator | SortExecTransformer | OrderByNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | SortMergeJoinExec | Sort merge join, replacing with shuffled hash join | SortMergeJoinExecTransformer | MergeJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | WindowExec | Window operator backend | WindowExecTransformer | WindowNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | GlobalLimitExec | Limiting of results across partitions | LimitTransformer | LimitNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | LocalLimitExec | Per-partition limiting of results | LimitTransformer | LimitNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | ExpandExec | The backend for the expand operator | ExpandExecTransformer | GroupIdNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | UnionExec | The backend for the union operator | UnionExecTransformer | N | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | DataWritingCommandExec | Writing data | Y | TableWriteNode | S | S | S | S | S | S | S | S | S | S | S | NS | S | S | NS | S | NS | NS | |
| | CartesianProductExec | Implementation of join using brute force | CartesianProductExecTransformer | NestedLoopJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | ShuffleExchangeExec | The backend for most data being exchanged between processes | ColumnarShuffleExchangeExec | ExchangeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| | | The unnest operation expands arrays and maps into separate columns | N | UnnestNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| | | The top-n operation reorders a dataset based on one or more identified sort fields as well as a sorting order | N | TopNNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| | | The partitioned output operation redistributes data based on zero or more distribution fields | N | PartitionedOutputNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| | | The values operation returns specified data | N | ValuesNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| | | A receiving operation that merges multiple ordered streams to maintain orderedness | N | MergeExchangeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| | | An operation that merges multiple ordered streams to maintain orderedness | N | LocalMergeNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| | | Partitions input data into multiple streams or combines data from multiple streams into a single stream | N | LocalPartitionNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| | | The enforce single row operation checks that input contains at most one row and returns that row unmodified | N | EnforceSingleRowNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | |
| | | The assign unique id operation adds one column at the end of the input columns with unique value per row | N | AssignUniqueIdNode | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS | S | S | S | S | S | |
| | ReusedExchangeExec | A wrapper for reused exchange to have different output | ReusedExchangeExec | N | | | | | | | | | | | | | | | | | | | |
| | CollectLimitExec | Reduce to single partition and apply limit | ColumnarCollectLimitExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | |
| | CollectTailExec | Collect the tail `x` elements from dataframe | ColumnarCollectTailExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | |
| | BroadcastExchangeExec | The backend for broadcast exchange of data | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS | |
| | ObjectHashAggregateExec | The backend for hash based aggregations supporting TypedImperativeAggregate functions | HashAggregateExecBaseTransformer | N | | | | | | | | | | | | | | | | | | | |
| | SortAggregateExec | The backend for sort based aggregations | HashAggregateExecBaseTransformer (Partially supported) | N | | | | | | | | | | | | | | | | | | | |
| | CoalesceExec | Reduce the partition numbers | CoalesceExecTransformer | N | | | | | | | | | | | | | | | | | | | |
| | GenerateExec | The backend for operations that generate more output rows than input rows like explode | GenerateExecTransformer | UnnestNode | | | | | | | | | | | | | | | | | | | |
| | RangeExec | The backend for range operator | ColumnarRangeExec | N | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | S | |
| | SampleExec | The backend for the sample operator | SampleExecTransformer | N | | | | | | | | | | | | | | | | | | | |
| | SubqueryBroadcastExec | Plan to collect and transform the broadcast key values | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS | |
| | TakeOrderedAndProjectExec | Take the first limit elements as defined by the sortOrder, and do projection if needed | Y | Y | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | S | NS | NS | |
| | CustomShuffleReaderExec | A wrapper of shuffle query stage | N | N | | | | | | | | | | | | | | | | | | | |
| | InMemoryTableScanExec | Implementation of InMemory Table Scan | Y | Y | | | | | | | | | | | | | | | | | | | |
| | BroadcastNestedLoopJoinExec | Implementation of join using brute force. Full outer joins and joins where the broadcast side matches the join side (e.g.: LeftOuter with left broadcast) are not supported | BroadcastNestedLoopJoinExecTransformer | NestedLoopJoinNode | S | S | S | S | S | S | S | S | S | S | NS | NS | NS | S | NS | NS | NS | NS | |
| | AggregateInPandasExec | The backend for an Aggregation Pandas UDF, this accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | | |
| | ArrowEvalPythonExec | The backend of the Scalar Pandas UDFs. Accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | | |
| | FlatMapGroupsInPandasExec | The backend for Flat Map Groups Pandas UDF, Accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | | |
| | MapInPandasExec | The backend for Map Pandas Iterator UDF. Accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | | |
| | WindowInPandasExec | The backend for Window Aggregation Pandas UDF, Accelerates the data transfer between the Java process and the Python process | N | N | | | | | | | | | | | | | | | | | | | |
| | HiveTableScanExec | The Hive table scan operator. Column and partition pruning are both handled | Y | Y | | | | | | | | | | | | | | | | | | | |
| | InsertIntoHiveTable | Command for writing data out to a Hive table | Y | Y | | | | | | | | | | | | | | | | | | | |
| | Velox2Row | Convert Velox format to Row format | Y | Y | S | S | S | S | S | S | S | S | NS | S | NS | NS | NS | S | S | NS | NS | NS | |
| | Velox2Arrow | Convert Velox format to Arrow format | Y | Y | S | S | S | S | S | S | S | S | NS | S | S | S | S | S | NS | S | NS | NS | |
| | WindowGroupLimitExec | Optimize window with rank like function with filter on it | Y | Y | S | S | S | S | S | S | S | S | NS | S | S | S | S | S | NS | S | NS | NS | |
| |
| ## Function Support Status |
| |
| Spark categorizes built-in functions into four types: Scalar Functions, Aggregate Functions, |
| Window Functions, and Generator Functions. |
| In Gluten, function support is automatically generated by a script and maintained in separate files. |
| |
| When running the script, the `--spark_home` arg should be set to either: |
| * The directory containing the Spark source code for the latest supported Spark version in Gluten, and the Spark |
| project must be built from source. |
| * Or use the `install_spark_resources.sh` script to get a directory with the necessary resource files: |
| ``` |
| # Define a directory to use for the Spark files and the latest Spark version |
| export spark_dir=/tmp/spark |
| export spark_version=3.5 |
| |
| # Run the install_spark_resources.sh script |
| .github/workflows/util/install_spark_resources.sh ${spark_version} ${spark_dir} |
| ``` |
| After running the `install_spark_resources.sh`, the `--spark_home` for the document generation script will be |
| something like: `--spark_home=${spark_dir}/shims/spark35/spark_home"` |
| |
| Use the following command to generate and update the support status: |
| ```shell |
| python3 tools/scripts/gen-function-support-docs.py --spark_home=/path/to/spark_source_code |
| ``` |
| |
| Please check the links below for the detailed support status of each category: |
| |
| [Scalar Functions Support Status](./velox-backend-scalar-function-support.md) |
| |
| [Aggregate Functions Support Status](./velox-backend-aggregate-function-support.md) |
| |
| [Window Functions Support Status](./velox-backend-window-function-support.md) |
| |
| [Generator Functions Support Status](./velox-backend-generator-function-support.md) |