Run the following command from the root of this repository to build the Comet Docker image, or use a published Docker image.
docker build -t apache/datafusion-comet -f kube/Dockerfile .
The exact syntax will vary depending on the Kubernetes distribution, but an example spark-submit command can be found here.
Install helm Spark operator for Kubernetes
# Add the Helm repository helm repo add spark-operator https://kubeflow.github.io/spark-operator helm repo update # Install the operator into the spark-operator namespace and wait for deployments to be ready helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace --wait
Check the operator is deployed
helm status --namespace spark-operator spark-operator NAME: my-release NAMESPACE: spark-operator STATUS: deployed REVISION: 1 TEST SUITE: None
Create example Spark application file spark-pi.yaml
apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication metadata: name: spark-pi namespace: default spec: type: Scala mode: cluster image: apache/datafusion-comet:0.9.1-spark3.5.5-scala2.12-java11 imagePullPolicy: IfNotPresent mainClass: org.apache.spark.examples.SparkPi mainApplicationFile: local:///opt/spark/examples/jars/spark-examples_2.12-3.5.5.jar sparkConf: "spark.executor.extraClassPath": "/opt/spark/jars/comet-spark-spark3.5_2.12-0.9.1.jar" "spark.driver.extraClassPath": "/opt/spark/jars/comet-spark-spark3.5_2.12-0.9.1.jar" "spark.plugins": "org.apache.spark.CometPlugin" "spark.comet.enabled": "true" "spark.comet.exec.enabled": "true" "spark.comet.cast.allowIncompatible": "true" "spark.comet.exec.shuffle.enabled": "true" "spark.comet.exec.shuffle.mode": "auto" "spark.shuffle.manager": "org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager" sparkVersion: 3.5.6 driver: labels: version: 3.5.6 cores: 1 coreLimit: 1200m memory: 512m serviceAccount: spark-operator-spark executor: labels: version: 3.5.6 instances: 1 cores: 1 coreLimit: 1200m memory: 512m
Refer to Comet builds
Run Apache Spark application with Comet enabled
kubectl apply -f spark-pi.yaml sparkapplication.sparkoperator.k8s.io/spark-pi created
Check application status
kubectl get sparkapp spark-pi NAME STATUS ATTEMPTS START FINISH AGE spark-pi RUNNING 1 2025-03-18T21:19:48Z <no value> 65s
To check more runtime details
kubectl describe sparkapplication spark-pi .... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SparkApplicationSubmitted 8m15s spark-application-controller SparkApplication spark-pi was submitted successfully Normal SparkDriverRunning 7m18s spark-application-controller Driver spark-pi-driver is running Normal SparkExecutorPending 7m11s spark-application-controller Executor [spark-pi-68732195ab217303-exec-1] is pending Normal SparkExecutorRunning 7m10s spark-application-controller Executor [spark-pi-68732195ab217303-exec-1] is running Normal SparkExecutorCompleted 7m5s spark-application-controller Executor [spark-pi-68732195ab217303-exec-1] completed Normal SparkDriverCompleted 7m4s spark-application-controller Driver spark-pi-driver completed
Get Driver Logs
kubectl logs spark-pi-driver
More info on Kube Spark operator