commit | 36cb159ee441e3815ec865d8221e6f3e5465c9b4 | [log] [tgz] |
---|---|---|
author | Zhang Li <richox@qq.com> | Mon Apr 28 15:41:26 2025 +0800 |
committer | GitHub <noreply@github.com> | Mon Apr 28 15:41:26 2025 +0800 |
tree | e52bc978fdd4cc17ca74d65afa46e75429f7d694 | |
parent | 7ab0272ce59e70b4d62d426193ad82a264fbe610 [diff] |
release version v5.0.0 (#973) Co-authored-by: zhangli20 <zhangli20@kuaishou.com>
The Blaze accelerator for Apache Spark leverages native vectorized execution to accelerate query processing. It combines the power of the Apache DataFusion library and the scale of the Spark distributed computing framework.
Blaze takes a fully optimized physical plan from Spark, mapping it into DataFusion's execution plan, and performs native plan computation in Spark executors.
Blaze is composed of the following high-level components:
Based on the inherent well-defined extensibility of DataFusion, Blaze can be easily extended to support:
We encourage you to extend DataFusion capability directly and add the supports in Blaze with simple modifications in plan-serde and extension translation.
To build Blaze, please follow the steps below:
The native execution lib is written in Rust. So you're required to install Rust (nightly) first for compilation. We recommend you to use rustup.
Ensure protoc
is available in PATH environment. protobuf can be installed via linux system package manager (or Homebrew on mac), or manually download and build from https://github.com/protocolbuffers/protobuf/releases .
Blaze has been well tested on jdk8 and maven3.5, should work fine with higher versions.
git clone git@github.com:kwai/blaze.git cd blaze
Specify shims package of which spark version that you would like to run on.
Currently, we have supported these shims:
You could either build Blaze in pre mode for debugging or in release mode to unlock the full potential of Blaze.
SHIM=spark-3.3 # or spark-3.0/spark-3.1/spark-3.2/spark-3.3/spark-3.4/spark-3.5 MODE=release # or pre mvn clean package -P"${SHIM}" -P"${MODE}"
Skip build native (native lib is already built, and you can check the native lib in native-engine/_build/${MODE}
).
SHIM=spark-3.3 # or spark-3.0/spark-3.1/spark-3.2/spark-3.3/spark-3.4/spark-3.5 MODE=release # or pre mvn clean package -P"${SHIM}" -P"${MODE}" -DskipBuildNative
After the build is finished, a fat Jar package that contains all the dependencies will be generated in the target
directory.
You can use the following command to build a centos-7 compatible release:
SHIM=spark-3.3 MODE=release ./release-docker.sh
This section describes how to submit and configure a Spark Job with Blaze support.
move blaze jar package to spark client classpath (normally spark-xx.xx.xx/jars/
).
add the follow confs to spark configuration in spark-xx.xx.xx/conf/spark-default.conf
:
spark.blaze.enable true spark.sql.extensions org.apache.spark.sql.blaze.BlazeSparkSessionExtension spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager spark.memory.offHeap.enabled false # suggested executor memory configuration spark.executor.memory 4g spark.executor.memoryOverhead 4096
spark-sql -f tpcds/q01.sql
Blaze has supported Celeborn integration now, use the following configurations to enable shuffling with Celeborn:
# change celeborn endpoint and storage directory to the correct location spark.shuffle.manager org.apache.spark.sql.execution.blaze.shuffle.celeborn.BlazeCelebornShuffleManager spark.serializer org.apache.spark.serializer.KryoSerializer spark.celeborn.master.endpoints localhost:9097 spark.celeborn.client.spark.shuffle.writer hash spark.celeborn.client.push.replicate.enabled false spark.celeborn.storage.availableTypes HDFS spark.celeborn.storage.hdfs.dir hdfs:///home/celeborn spark.sql.adaptive.localShuffleReader.enabled false
Check TPC-H Benchmark Results. The latest benchmark result shows that Blaze saved more than 50% time on TPC-H 1TB datasets comparing with Vanilla Spark 3.5.
Stay tuned and join us for more upcoming thrilling numbers.
TPC-H Query time:
We also encourage you to benchmark Blaze and share the results with us. 🤗
We're using Discussions to connect with other members of our community. We hope that you:
Blaze is licensed under the Apache 2.0 License. A copy of the license can be found here.