Apache Spark is a mature and stable project that has been under continuous development for many years. It is one of the most widely used frameworks for scaling out the processing of petabyte-scale datasets. Over time, the Spark community has had to address significant performance challenges, which required a variety of optimizations. A major milestone came with Spark 2.0, where Whole-Stage Code Generation replaced the Volcano Model, delivering up to a 2× speedup. Since then, most subsequent improvements have focused on the query plan level, while the performance of individual operators has almost stopped improving.
In recent years, several native SQL engines have been developed, such as ClickHouse and Velox. With features like native execution, columnar data formats, and vectorized data processing, these engines can outperform Spark’s JVM-based SQL engine. However, they currently don't directly support Spark SQL execution.
“Gluten” is Latin for “glue”. The main goal of the Gluten project is to glue native engines to Spark SQL. Thus, we can benefit from the high performance of native engines and the high scalability enabled by the Spark ecosystem.
The basic design principle is to reuse Spark’s control flow, while offloading compute-intensive data processing to the native side. More specifically:
Gluten's target users include anyone who wants to fundamentally accelerate Spark SQL. As a plugin to Spark, Gluten requires no changes to the DataFrame API or SQL queries; users only need to configure it correctly.
The overview chart is shown below. Substrait provides a well-defined, cross-language specification for data compute operations. Spark’s physical plan is transformed into a Substrait plan, which is then passed to the native side through a JNI call. On the native side, a chain of native operators is constructed and offloaded to the native engine. Gluten returns the results as a ColumnarBatch, and Spark’s Columnar API (introduced in Spark 3.0) is used during execution. Gluten adopts the Apache Arrow data format as its underlying representation.
Gluten's key components:
Below is a basic configuration to enable Gluten in Spark.
export GLUTEN_JAR=/PATH/TO/GLUTEN_JAR spark-shell \ --master yarn --deploy-mode client \ --conf spark.plugins=org.apache.gluten.GlutenPlugin \ --conf spark.memory.offHeap.enabled=true \ --conf spark.memory.offHeap.size=20g \ --conf spark.driver.extraClassPath=${GLUTEN_JAR} \ --conf spark.executor.extraClassPath=${GLUTEN_JAR} \ --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager ...
There are two ways to acquire Gluten jar for the above configuration.
Please download the tar package here, then extract Gluten JAR from it. Additionally, Gluten provides nightly builds based on the main branch for early testing. The nightly build JARs are available at Apache Gluten Nightlies. They have been verified on Centos 7/8/9, Ubuntu 20.04/22.04.
For Velox backend, please refer to Velox.md and build-guide.md.
For ClickHouse backend, please refer to ClickHouse.md.
The Gluten JAR will be generated under /PATH/TO/GLUTEN/package/target/
after the build.
Common configurations used by Gluten are listed in Configuration.md. Velox specific configurations are listed in velox-configuration.md.
The Gluten Velox backend honors some Spark configurations, ignores others, and many are transparent to it. See velox-spark-configuration.md for details, and velox-parquet-write-configuration.md for Parquet write configurations.
Welcome to contribute to the Gluten project! See CONTRIBUTING.md for guidelines on how to make contributions.
Gluten successfully became an Apache Incubator project in March 2024. Here are several ways to connect with the community.
Welcome to report issues or start discussions in GitHub. Please search the GitHub issue list before creating a new one to avoid duplication.
For any technical discussions, please email dev@gluten.apache.org. You can browse the archives to view past discussions, or subscribe to the mailing list to receive updates.
Request an invitation to the ASF Slack workspace via this page. Once invited, you can join the incubator-gluten channel.
The ASF Slack login entry: https://the-asf.slack.com/.
Please contact weitingchen at apache.org or zhangzc at apache.org to request an invitation to the WeChat group. It is for Chinese-language communication.
TPC-H is used to evaluate Gluten's performance. Please note that the results below do not reflect the latest performance.
The Gluten Velox backend demonstrated an overall speedup of 2.71x, with up to a 14.53x speedup observed in a single query.
Tested in Jun. 2023. Test environment: single node with 2TB data, using Spark 3.3.2 as the baseline and with Gluten integrated into the same Spark version.
ClickHouse backend demonstrated an average speedup of 2.12x, with up to 3.48x speedup observed in a single query.
Test environment: a 8-nodes AWS cluster with 1TB data, using Spark 3.1.1 as the baseline and with Gluten integrated into the same Spark version.
The Qualification Tool is a utility to analyze Spark event log files and assess the compatibility and performance of SQL workloads with Gluten. This tool helps users understand how their workloads can benefit from Gluten.
Gluten is licensed under Apache 2.0 license.
Gluten was initiated by Intel and Kyligence in 2022. Several other companies are also actively contributing to its development, including BIGO, Meituan, Alibaba Cloud, NetEase, Baidu, Microsoft, IBM, Google, etc.
* LEGAL NOTICE: Your use of this software and any required dependent software (the “Software Package”) is subject to the terms and conditions of the software license agreements for the Software Package, which may also include notices, disclaimers, or license terms for third party or open source software included in or with the Software Package, and your use indicates your acceptance of all such terms. Please refer to the “TPP.txt” or other similarly-named text file included with the Software Package for additional details.