Type | Version |
---|---|
Flink | 1.19.2 |
OS | Ubuntu20.04/22.04, Centos7/8 |
jdk | openjdk11/jdk17 |
scala | 2.12 |
Currently, with static build Gluten+Flink+Velox backend supports all the Linux OSes, but is only tested on Ubuntu20.04. With dynamic build, Gluten+Velox backend support Ubuntu20.04/Ubuntu22.04/Centos7/Centos8 and their variants.
Currently, the officially supported Flink version is 1.19.2.
We need to set up the JAVA_HOME
env. Currently, Gluten supports java 11 and java 17.
For x86_64
## make sure jdk11 is used export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 export PATH=$JAVA_HOME/bin:$PATH
For aarch64
## make sure jdk11 is used export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-arm64 export PATH=$JAVA_HOME/bin:$PATH
Get Velox4j
Gluten for Flink depends on Velox4j to call velox. This is an experimental feature. You need to get the Velox4j code, and compile it first.
As some features have not been committed to upstream, you have to use the following fork to run it first.
## fetch velox4j code git clone -b gluten-0530 https://github.com/bigo-sg/velox4j.git cd velox4j git reset --hard 0180528e9b98fad22bc9da8a3864d2929ef73eec mvn clean install -DskipTests -Dgpg.skip -Dspotless.skip=true
Get gluten
## config maven, like proxy in ~/.m2/settings.xml ## fetch gluten code git clone https://github.com/apache/incubator-gluten.git
cd /path/to/gluten/gluten-flink mvn clean package -Dmaven.test.skip=true
Get Nexmark
git clone https://github.com/nexmark/nexmark.git cd nexmark mvn clean install -DskipTests
Run Tests
cd /path/to/gluten/gluten-flink mvn test
Submit test script from flink run
. You can use the StreamSQLExample
as an example.
After deploying Flink binaries, please configure the paths of gluten-flink jars and the dependency jars for Flink to use, as follows:
# notice: first set your own specified project home, you may set it # mannualy in you .bash_profile so that it can auto take effect. export VELOX4J_HOME= export GLUTEN_FLINK_HOME= export FLINK_HOME= cd $FLINK_HOME mkdir -p gluten_lib ln -s $VELOX4J_HOME/target/velox4j-0.1.0-SNAPSHOT.jar $FLINK_HOME/gluten_lib/velox4j-0.1.0-SNAPSHOT.jar ln -s $GLUTEN_FLINK_HOME/runtime/target/gluten-flink-runtime-1.6.0-SNAPSHOT.jar $FLINK_HOME/gluten_lib/gluten-flink-runtime-1.6.0.jar ln -s $GLUTEN_FLINK_HOME/loader/target/gluten-flink-loader-1.6.0-SNAPSHOT.jar $FLINK_HOME/gluten_lib/gluten-flink-loader-1.6.0.jar
And make them loaded before flink libraries.
Gluten classes need to be loaded first in Flink, you can modify the constructFlinkClassPath function in $FLINK_HOME/bin/config.sh
like this:
GLUTEN_JAR="$FLINK_HOME/gluten_lib/gluten-flink-loader-1.6.0.jar:$FLINK_HOME/gluten_lib/velox4j-0.1.0-SNAPSHOT.jar:$FLINK_HOME/gluten_lib/gluten-flink-runtime-1.6.0.jar:" echo "$GLUTEN_JAR""$FLINK_CLASSPATH""$FLINK_DIST"
Then you can go to flink binary path and use the below scripts to submit the example job.
cd $FLINK_HOME bin/start-cluster.sh bin/flink run examples/table/StreamSQLExample.jar
Then you can get the result in log/flink-*-taskexecutor-*.out
. And you can see an operator named gluten-cal
from the web frontend of your flink job.
Notice: current this example will cause npe until issue-10315 get resolved.
Another example supports all operators executed by native. You can use the data-generator.sql under dev directory.
bin/sql-client.sh -f data-generator.sql
TODO
We are working on supporting the Nexmark benchmark for Flink. Now the q0 has been supported.
Results show that running with gluten can be 2.x times faster than Flink.
Result using gluten (will support TPS metric soon):
-------------------------------- Nexmark Results -------------------------------- +------+-----------------+--------+----------+-----------------+--------------+-----------------+ | Query| Events Num | Cores | Time(s) | Cores * Time(s) | Throughput | Throughput/Cores| +------+-----------------+--------+----------+-----------------+--------------+-----------------+ |q0 |100,000,000 |NaN |161.428 |NaN |619.47 K/s |0/s | |Total |100,000,000 |NaN |161.428 |NaN |619.47 K/s |0/s | +------+-----------------+--------+----------+-----------------+--------------+-----------------+
Result using Flink:
-------------------------------- Nexmark Results -------------------------------- +------+-----------------+--------+----------+-----------------+--------------+-----------------+ | Query| Events Num | Cores | Time(s) | Cores * Time(s) | Throughput | Throughput/Cores| +------+-----------------+--------+----------+-----------------+--------------+-----------------+ |q0 |100,000,000 |1.21 |462.069 |558.210 |216.42 K/s |179.14/s | |Total |100,000,000 |1.208 |462.069 |558.210 |216.42 K/s |179.14/s | +------+-----------------+--------+----------+-----------------+--------------+-----------------+
We are still optimizing it.
Now both Gluten for Flink and Velox4j have not a bundled jar including all jars depends on. So you may have to add these jars by yourself, which may including guava-33.4.0-jre.jar, jackson-core-2.18.0.jar, jackson-databind-2.18.0.jar, jackson-datatype-jdk8-2.18.0.jar, jackson-annotations-2.18.0.jar, arrow-memory-core-18.1.0.jar, arrow-memory-unsafe-18.1.0.jar, arrow-vector-18.1.0.jar, flatbuffers-java-24.3.25.jar, arrow-format-18.1.0.jar, arrow-c-data-18.1.0.jar. We will supply bundled jars soon.