layout: global title: Running Spark on Kubernetes license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

This will become a table of contents (this text will be scraped). {:toc}

Spark can run on clusters managed by Kubernetes. This feature makes use of native Kubernetes scheduler that has been added to Spark.

Security

Security features like authentication are not enabled by default. When deploying a cluster that is open to the internet or an untrusted network, it's important to secure access to the cluster to prevent unauthorized applications from running on the cluster. Please see Spark Security and the specific security sections in this doc before running Spark.

User Identity

Images built from the project provided Dockerfiles contain a default USER directive with a default UID of 185. This means that the resulting images will be running the Spark processes as this UID inside the container. Security conscious deployments should consider providing custom images with USER directives specifying their desired unprivileged UID and GID. The resulting UID should include the root group in its supplementary groups in order to be able to run the Spark executables. Users building their own images with the provided docker-image-tool.sh script can use the -u <uid> option to specify the desired UID.

Alternatively the Pod Template feature can be used to add a Security Context with a runAsUser to the pods that Spark submits. This can be used to override the USER directives in the images themselves. Please bear in mind that this requires cooperation from your users and as such may not be a suitable solution for shared environments. Cluster administrators should use Pod Security Policies if they wish to limit the users that pods may run as.

Volume Mounts

As described later in this document under Using Kubernetes Volumes Spark on K8S provides configuration options that allow for mounting certain volume types into the driver and executor pods. In particular it allows for hostPath volumes which as described in the Kubernetes documentation have known security vulnerabilities.

Cluster administrators should use Pod Security Policies to limit the ability to mount hostPath volumes appropriately for their environments.

Prerequisites

A running Kubernetes cluster at version >= 1.27 with access configured to it using kubectl. If you do not already have a working Kubernetes cluster, you may set up a test cluster on your local machine using minikube.
- We recommend using the latest release of minikube with the DNS addon enabled.
- Be aware that the default minikube configuration is not enough for running Spark applications. We recommend 3 CPUs and 4g of memory to be able to start a simple Spark application with a single executor.
- Check kubernetes-client library‘s version of your Spark environment, and its compatibility with your Kubernetes cluster’s version.
You must have appropriate permissions to list, create, edit and delete pods in your cluster. You can verify that you can list these resources by running kubectl auth can-i <list|create|edit|delete> pods.
- The service account credentials used by the driver pods must be allowed to create pods, services and configmaps.
You must have Kubernetes DNS configured in your cluster.

How it works

spark-submit can be directly used to submit a Spark application to a Kubernetes cluster. The submission mechanism works as follows:

Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods and connects to them, and executes application code.
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it's eventually garbage collected or manually cleaned up.

Note that in the completed state, the driver pod does not use any computational or memory resources.

The driver and executor pod scheduling is handled by Kubernetes. Communication to the Kubernetes API is done via fabric8. It is possible to schedule the driver and executor pods on a subset of available nodes through a node selector using the configuration property for it. It will be possible to use more advanced scheduling hints like node/pod affinities in a future release.

Submitting Applications to Kubernetes

Docker Images

Kubernetes requires users to supply images that can be deployed into containers within pods. The images are built to be run in a container runtime environment that Kubernetes supports. Docker is a container runtime environment that is frequently used with Kubernetes. Spark (starting with version 2.3) ships with a Dockerfile that can be used for this purpose, or customized to match an individual application's needs. It can be found in the kubernetes/dockerfiles/ directory.

Spark also ships with a bin/docker-image-tool.sh script that can be used to build and publish the Docker images to use with the Kubernetes backend.

Example usage is:

$ ./bin/docker-image-tool.sh -r <repo> -t my-tag build
$ ./bin/docker-image-tool.sh -r <repo> -t my-tag push

This will build using the projects provided default Dockerfiles. To see more options available for customising the behaviour of this tool, including providing custom Dockerfiles, please run with the -h flag.

By default bin/docker-image-tool.sh builds docker image for running JVM jobs. You need to opt-in to build additional language binding docker images.

Example usage is

# To build additional PySpark docker image
$ ./bin/docker-image-tool.sh -r <repo> -t my-tag -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build

# To build additional SparkR docker image
$ ./bin/docker-image-tool.sh -r <repo> -t my-tag -R ./kubernetes/dockerfiles/spark/bindings/R/Dockerfile build

You can also use the Apache Spark Docker images (such as apache/spark:<version>) directly.

Cluster Mode

To launch Spark Pi in cluster mode,

$ ./bin/spark-submit \
    --master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
    --deploy-mode cluster \
    --name spark-pi \
    --class org.apache.spark.examples.SparkPi \
    --conf spark.executor.instances=5 \
    --conf spark.kubernetes.container.image=<spark-image> \
    local:///path/to/examples.jar

The Spark master, specified either via passing the --master command line argument to spark-submit or by setting spark.master in the application‘s configuration, must be a URL with the format k8s://<api_server_host>:<k8s-apiserver-port>. The port must always be specified, even if it’s the HTTPS port 443. Prefixing the master string with k8s:// will cause the Spark application to launch on the Kubernetes cluster, with the API server being contacted at api_server_url. If no HTTP protocol is specified in the URL, it defaults to https. For example, setting the master to k8s://example.com:443 is equivalent to setting it to k8s://https://example.com:443, but to connect without TLS on a different port, the master would be set to k8s://http://example.com:8080.

In Kubernetes mode, the Spark application name that is specified by spark.app.name or the --name argument to spark-submit is used by default to name the Kubernetes resources created like drivers and executors. So, application names must consist of lower case alphanumeric characters, -, and . and must start and end with an alphanumeric character.

If you have a Kubernetes cluster setup, one way to discover the apiserver URL is by executing kubectl cluster-info.

$ kubectl cluster-info
Kubernetes master is running at http://127.0.0.1:6443

In the above example, the specific Kubernetes cluster can be used with spark-submit by specifying --master k8s://http://127.0.0.1:6443 as an argument to spark-submit. Additionally, it is also possible to use the authenticating proxy, kubectl proxy to communicate to the Kubernetes API.

The local proxy can be started by:

$ kubectl proxy

If the local proxy is running at localhost:8001, --master k8s://http://127.0.0.1:8001 can be used as the argument to spark-submit. Finally, notice that in the above example we specify a jar with a specific URI with a scheme of local://. This URI is the location of the example jar that is already in the Docker image.

Client Mode

Starting with Spark 2.4.0, it is possible to run Spark applications on Kubernetes in client mode. When your application runs in client mode, the driver can run inside a pod or on a physical host. When running an application in client mode, it is recommended to account for the following factors:

Client Mode Networking

Spark executors must be able to connect to the Spark driver over a hostname and a port that is routable from the Spark executors. The specific network configuration that will be required for Spark to work in client mode will vary per setup. If you run your driver inside a Kubernetes pod, you can use a headless service to allow your driver pod to be routable from the executors by a stable hostname. When deploying your headless service, ensure that the service‘s label selector will only match the driver pod and no other pods; it is recommended to assign your driver pod a sufficiently unique label and to use that label in the label selector of the headless service. Specify the driver’s hostname via spark.driver.host and your spark driver's port to spark.driver.port.

Client Mode Executor Pod Garbage Collection

If you run your Spark driver in a pod, it is highly recommended to set spark.kubernetes.driver.pod.name to the name of that pod. When this property is set, the Spark scheduler will deploy the executor pods with an OwnerReference, which in turn will ensure that once the driver pod is deleted from the cluster, all of the application‘s executor pods will also be deleted. The driver will look for a pod with the given name in the namespace specified by spark.kubernetes.namespace, and an OwnerReference pointing to that pod will be added to each executor pod’s OwnerReferences list. Be careful to avoid setting the OwnerReference to a pod that is not actually that driver pod, or else the executors may be terminated prematurely when the wrong pod is deleted.

If your application is not running inside a pod, or if spark.kubernetes.driver.pod.name is not set when your application is actually running in a pod, keep in mind that the executor pods may not be properly deleted from the cluster when the application exits. The Spark scheduler attempts to delete these pods, but if the network request to the API server fails for any reason, these pods will remain in the cluster. The executor processes should exit when they cannot reach the driver, so the executor pods should not consume compute resources (cpu and memory) in the cluster after your application exits.

You may use spark.kubernetes.executor.podNamePrefix to fully control the executor pod names. When this property is set, it's highly recommended to make it unique across all jobs in the same namespace.

Authentication Parameters

Use the exact prefix spark.kubernetes.authenticate for Kubernetes authentication parameters in client mode.

IPv4 and IPv6

Starting with 3.4.0, Spark supports additionally IPv6-only environment via IPv4/IPv6 dual-stack network feature which enables the allocation of both IPv4 and IPv6 addresses to Pods and Services. According to the K8s cluster capability, spark.kubernetes.driver.service.ipFamilyPolicy and spark.kubernetes.driver.service.ipFamilies can be one of SingleStack, PreferDualStack, and RequireDualStack and one of IPv4, IPv6, IPv4,IPv6, and IPv6,IPv4 respectively. By default, Spark uses spark.kubernetes.driver.service.ipFamilyPolicy=SingleStack and spark.kubernetes.driver.service.ipFamilies=IPv4.

To use only IPv6, you can submit your jobs with the following.

...
    --conf spark.kubernetes.driver.service.ipFamilies=IPv6 \

In DualStack environment, you may need java.net.preferIPv6Addresses=true for JVM and SPARK_PREFER_IPV6=true for Python additionally to use IPv6.

Dependency Management

If your application‘s dependencies are all hosted in remote locations like HDFS or HTTP servers, they may be referred to by their appropriate remote URIs. Also, application dependencies can be pre-mounted into custom-built Docker images. Those dependencies can be added to the classpath by referencing them with local:// URIs and/or setting the SPARK_EXTRA_CLASSPATH environment variable in your Dockerfiles. The local:// scheme is also required when referring to dependencies in custom-built Docker images in spark-submit. We support dependencies from the submission client’s local file system using the file:// scheme or without a scheme (using a full path), where the destination should be a Hadoop compatible filesystem. A typical example of this using S3 is via passing the following options:

...
--packages org.apache.hadoop:hadoop-aws:3.4.0
--conf spark.kubernetes.file.upload.path=s3a://<s3-bucket>/path
--conf spark.hadoop.fs.s3a.access.key=...
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
--conf spark.hadoop.fs.s3a.fast.upload=true
--conf spark.hadoop.fs.s3a.secret.key=....
--conf spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp
file:///full/path/to/app.jar

The app jar file will be uploaded to the S3 and then when the driver is launched it will be downloaded to the driver pod and will be added to its classpath. Spark will generate a subdir under the upload path with a random name to avoid conflicts with spark apps running in parallel. User could manage the subdirs created according to his needs.

The client scheme is supported for the application jar, and dependencies specified by properties spark.jars, spark.files and spark.archives.

Important: all client-side dependencies will be uploaded to the given path with a flat directory structure so file names must be unique otherwise files will be overwritten. Also make sure in the derived k8s image default ivy dir has the required access rights or modify the settings as above. The latter is also important if you use --packages in cluster mode.

Secret Management

Kubernetes Secrets can be used to provide credentials for a Spark application to access secured services. To mount a user-specified secret into the driver container, users can use the configuration property of the form spark.kubernetes.driver.secrets.[SecretName]=<mount path>. Similarly, the configuration property of the form spark.kubernetes.executor.secrets.[SecretName]=<mount path> can be used to mount a user-specified secret into the executor containers. Note that it is assumed that the secret to be mounted is in the same namespace as that of the driver and executor pods. For example, to mount a secret named spark-secret onto the path /etc/secrets in both the driver and executor containers, add the following options to the spark-submit command:

--conf spark.kubernetes.driver.secrets.spark-secret=/etc/secrets
--conf spark.kubernetes.executor.secrets.spark-secret=/etc/secrets

To use a secret through an environment variable use the following options to the spark-submit command:

--conf spark.kubernetes.driver.secretKeyRef.ENV_NAME=name:key
--conf spark.kubernetes.executor.secretKeyRef.ENV_NAME=name:key

Pod Template

Kubernetes allows defining pods from template files. Spark users can similarly use template files to define the driver or executor pod configurations that Spark configurations do not support. To do so, specify the spark properties spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile to point to files accessible to the spark-submit process.

--conf spark.kubernetes.driver.podTemplateFile=s3a://bucket/driver.yml
--conf spark.kubernetes.executor.podTemplateFile=s3a://bucket/executor.yml

To allow the driver pod access the executor pod template file, the file will be automatically mounted onto a volume in the driver pod when it's created. Spark does not do any validation after unmarshalling these template files and relies on the Kubernetes API server for validation.

It is important to note that Spark is opinionated about certain pod configurations so there are values in the pod template that will always be overwritten by Spark. Therefore, users of this feature should note that specifying the pod template file only lets Spark start with a template pod instead of an empty pod during the pod-building process. For details, see the full list of pod template values that will be overwritten by spark.

Pod template files can also define multiple containers. In such cases, you can use the spark properties spark.kubernetes.driver.podTemplateContainerName and spark.kubernetes.executor.podTemplateContainerName to indicate which container should be used as a basis for the driver or executor. If not specified, or if the container name is not valid, Spark will assume that the first container in the list will be the driver or executor container.

Using Kubernetes Volumes

Users can mount the following types of Kubernetes volumes into the driver and executor pods:

hostPath: mounts a file or directory from the host node’s filesystem into a pod.
emptyDir: an initially empty volume created when a pod is assigned to a node.
nfs: mounts an existing NFS(Network File System) into a pod.
persistentVolumeClaim: mounts a PersistentVolume into a pod.

NB: Please see the Security section of this document for security issues related to volume mounts.

To mount a volume of any of the types above into the driver pod, use the following configuration property:

--conf spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.path=<mount path>
--conf spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.readOnly=<true|false>
--conf spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].mount.subPath=<mount subPath>

Specifically, VolumeType can be one of the following values: hostPath, emptyDir, nfs and persistentVolumeClaim. VolumeName is the name you want to use for the volume under the volumes field in the pod specification.

Each supported type of volumes may have some specific configuration options, which can be specified using configuration properties of the following form:

spark.kubernetes.driver.volumes.[VolumeType].[VolumeName].options.[OptionName]=<value>

For example, the server and path of a nfs with volume name images can be specified using the following properties:

spark.kubernetes.driver.volumes.nfs.images.options.server=example.com
spark.kubernetes.driver.volumes.nfs.images.options.path=/data

And, the claim name of a persistentVolumeClaim with volume name checkpointpvc can be specified using the following property:

spark.kubernetes.driver.volumes.persistentVolumeClaim.checkpointpvc.options.claimName=check-point-pvc-claim

The configuration properties for mounting volumes into the executor pods use prefix spark.kubernetes.executor. instead of spark.kubernetes.driver..

For example, you can mount a dynamically-created persistent volume claim per executor by using OnDemand as a claim name and storageClass and sizeLimit options like the following. This is useful in case of Dynamic Allocation.

spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.claimName=OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.storageClass=gp
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.options.sizeLimit=500Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.path=/data
spark.kubernetes.executor.volumes.persistentVolumeClaim.data.mount.readOnly=false

For a complete list of available options for each supported type of volumes, please refer to the Spark Properties section below.

PVC-oriented executor pod allocation

Since disks are one of the important resource types, Spark driver provides a fine-grained control via a set of configurations. For example, by default, on-demand PVCs are owned by executors and the lifecycle of PVCs are tightly coupled with its owner executors. However, on-demand PVCs can be owned by driver and reused by another executors during the Spark job's lifetime with the following options. This reduces the overhead of PVC creation and deletion.

spark.kubernetes.driver.ownPersistentVolumeClaim=true
spark.kubernetes.driver.reusePersistentVolumeClaim=true

In addition, since Spark 3.4, Spark driver is able to do PVC-oriented executor allocation which means Spark counts the total number of created PVCs which the job can have, and holds on a new executor creation if the driver owns the maximum number of PVCs. This helps the transition of the existing PVC from one executor to another executor.

spark.kubernetes.driver.waitToReusePersistentVolumeClaim=true

Local Storage

Spark supports using volumes to spill data during shuffles and other operations. To use a volume as local storage, the volume's name should starts with spark-local-dir-, for example:

--conf spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.path=<mount path>
--conf spark.kubernetes.driver.volumes.[VolumeType].spark-local-dir-[VolumeName].mount.readOnly=false

Specifically, you can use persistent volume claims if the jobs require large shuffle and sorting operations in executors.

spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=gp
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false

To enable shuffle data recovery feature via the built-in KubernetesLocalDiskShuffleDataIO plugin, we need to have the followings. You may want to enable spark.kubernetes.driver.waitToReusePersistentVolumeClaim additionally.

spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data/spark-x/executor-x
spark.shuffle.sort.io.plugin.class=org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIO

If no volume is set as local storage, Spark uses temporary scratch space to spill data to disk during shuffles and other operations. When using Kubernetes as the resource manager the pods will be created with an emptyDir volume mounted for each directory listed in spark.local.dir or the environment variable SPARK_LOCAL_DIRS . If no directories are explicitly specified then a default directory is created and configured appropriately.

emptyDir volumes use the ephemeral storage feature of Kubernetes and do not persist beyond the life of the pod.

Using RAM for local storage

emptyDir volumes use the nodes backing storage for ephemeral storage by default, this behaviour may not be appropriate for some compute environments. For example if you have diskless nodes with remote storage mounted over a network, having lots of executors doing IO to this remote storage may actually degrade performance.

In this case it may be desirable to set spark.kubernetes.local.dirs.tmpfs=true in your configuration which will cause the emptyDir volumes to be configured as tmpfs i.e. RAM backed volumes. When configured like this Spark's local storage usage will count towards your pods memory usage therefore you may wish to increase your memory requests by increasing the value of spark.{driver,executor}.memoryOverheadFactor as appropriate.

Introspection and Debugging

These are the different ways in which you can investigate a running/completed Spark application, monitor progress, and take actions.

Accessing Logs

Logs can be accessed using the Kubernetes API and the kubectl CLI. When a Spark application is running, it's possible to stream logs from the application using:

$ kubectl -n=<namespace> logs -f <driver-pod-name>

The same logs can also be accessed through the Kubernetes dashboard if installed on the cluster.

Accessing Driver UI

The UI associated with any application can be accessed locally using kubectl port-forward.

$ kubectl port-forward <driver-pod-name> 4040:4040

Then, the Spark driver UI can be accessed on http://localhost:4040.

Since Apache Spark 4.0.0, Driver UI provides a way to see driver logs via a new configuration.

spark.driver.log.localDir=/tmp

Then, the Spark driver UI can be accessed on http://localhost:4040/logs/. Optionally, the layout of log is configured by the following.

spark.driver.log.layout="%m%n%ex"

Debugging

There may be several kinds of failures. If the Kubernetes API server rejects the request made from spark-submit, or the connection is refused for a different reason, the submission logic should indicate the error encountered. However, if there are errors during the running of the application, often, the best way to investigate may be through the Kubernetes CLI.

To get some basic information about the scheduling decisions made around the driver pod, you can run:

$ kubectl describe pod <spark-driver-pod>

If the pod has encountered a runtime error, the status can be probed further using:

$ kubectl logs <spark-driver-pod>

Status and logs of failed executor pods can be checked in similar ways. Finally, deleting the driver pod will clean up the entire spark application, including all executors, associated service, etc. The driver pod can be thought of as the Kubernetes representation of the Spark application.

Kubernetes Features

Configuration File

Your Kubernetes config file typically lives under .kube/config in your home directory or in a location specified by the KUBECONFIG environment variable. Spark on Kubernetes will attempt to use this file to do an initial auto-configuration of the Kubernetes client used to interact with the Kubernetes cluster. A variety of Spark configuration properties are provided that allow further customising the client configuration e.g. using an alternative authentication method.

Contexts

Kubernetes configuration files can contain multiple contexts that allow for switching between different clusters and/or user identities. By default Spark on Kubernetes will use your current context (which can be checked by running kubectl config current-context) when doing the initial auto-configuration of the Kubernetes client.

In order to use an alternative context users can specify the desired context via the Spark configuration property spark.kubernetes.context e.g. spark.kubernetes.context=minikube.

Namespaces

Kubernetes has the concept of namespaces. Namespaces are ways to divide cluster resources between multiple users (via resource quota). Spark on Kubernetes can use namespaces to launch Spark applications. This can be made use of through the spark.kubernetes.namespace configuration.

Kubernetes allows using ResourceQuota to set limits on resources, number of objects, etc on individual namespaces. Namespaces and ResourceQuota can be used in combination by administrator to control sharing and resource allocation in a Kubernetes cluster running Spark applications.

RBAC

In Kubernetes clusters with RBAC enabled, users can configure Kubernetes RBAC roles and service accounts used by the various Spark on Kubernetes components to access the Kubernetes API server.

The Spark driver pod uses a Kubernetes service account to access the Kubernetes API server to create and watch executor pods. The service account used by the driver pod must have the appropriate permission for the driver to be able to do its work. Specifically, at minimum, the service account must be granted a Role or ClusterRole that allows driver pods to create pods and services. By default, the driver pod is automatically assigned the default service account in the namespace specified by spark.kubernetes.namespace, if no service account is specified when the pod gets created.

Depending on the version and setup of Kubernetes deployed, this default service account may or may not have the role that allows driver pods to create pods and services under the default Kubernetes RBAC policies. Sometimes users may need to specify a custom service account that has the right role granted. Spark on Kubernetes supports specifying a custom service account to be used by the driver pod through the configuration property spark.kubernetes.authenticate.driver.serviceAccountName=<service account name>. For example, to make the driver pod use the spark service account, a user simply adds the following option to the spark-submit command:

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

To create a custom service account, a user can use the kubectl create serviceaccount command. For example, the following command creates a service account named spark:

$ kubectl create serviceaccount spark

To grant a service account a Role or ClusterRole, a RoleBinding or ClusterRoleBinding is needed. To create a RoleBinding or ClusterRoleBinding, a user can use the kubectl create rolebinding (or clusterrolebinding for ClusterRoleBinding) command. For example, the following command creates an edit ClusterRole in the default namespace and grants it to the spark service account created above:

$ kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

Note that a Role can only be used to grant access to resources (like pods) within a single namespace, whereas a ClusterRole can be used to grant access to cluster-scoped resources (like nodes) as well as namespaced resources (like pods) across all namespaces. For Spark on Kubernetes, since the driver always creates executor pods in the same namespace, a Role is sufficient, although users may use a ClusterRole instead. For more information on RBAC authorization and how to configure Kubernetes service accounts for pods, please refer to Using RBAC Authorization and Configure Service Accounts for Pods.

Spark Application Management

Kubernetes provides simple application management via the spark-submit CLI tool in cluster mode. Users can kill a job by providing the submission ID that is printed when submitting their job. The submission ID follows the format namespace:driver-pod-name. If user omits the namespace then the namespace set in current k8s context is used. For example if user has set a specific namespace as follows kubectl config set-context minikube --namespace=spark then the spark namespace will be used by default. On the other hand, if there is no namespace added to the specific context then all namespaces will be considered by default. That means operations will affect all Spark applications matching the given submission ID regardless of namespace. Moreover, spark-submit for application management uses the same backend code that is used for submitting the driver, so the same properties like spark.kubernetes.context etc., can be re-used.

For example:

$ spark-submit --kill spark:spark-pi-1547948636094-driver --master k8s://https://192.168.2.8:8443

Users also can list the application status by using the --status flag:

$ spark-submit --status spark:spark-pi-1547948636094-driver --master  k8s://https://192.168.2.8:8443

Both operations support glob patterns. For example user can run:

$ spark-submit --kill spark:spark-pi* --master  k8s://https://192.168.2.8:8443

The above will kill all application with the specific prefix.

User can specify the grace period for pod termination via the spark.kubernetes.appKillPodDeletionGracePeriod property, using --conf as means to provide it (default value for all K8s pods is 30 secs).

Future Work

There are several Spark on Kubernetes features that are currently being worked on or planned to be worked on. Those features are expected to eventually make it into future versions of the spark-kubernetes integration.

Some of these include:

External Shuffle Service
Job Queues and Resource Management

Configuration

See the configuration page for information on Spark configurations. The following configurations are specific to Spark on Kubernetes.

Spark Properties

Pod template properties

See the below table for the full list of pod specifications that will be overwritten by spark.

Pod Metadata

Pod Spec

Container spec

The following affect the driver and executor containers. All other containers in the pod spec will be unaffected.

Resource Allocation and Configuration Overview

Please make sure to have read the Custom Resource Scheduling and Configuration Overview section on the configuration page. This section only talks about the Kubernetes specific aspects of resource scheduling.

The user is responsible to properly configuring the Kubernetes cluster to have the resources available and ideally isolate each resource per container so that a resource is not shared between multiple containers. If the resource is not isolated the user is responsible for writing a discovery script so that the resource is not shared between containers. See the Kubernetes documentation for specifics on configuring Kubernetes with custom resources.

Spark automatically handles translating the Spark configs spark.{driver/executor}.resource.{resourceType} into the kubernetes configs as long as the Kubernetes resource type follows the Kubernetes device plugin format of vendor-domain/resourcetype. The user must specify the vendor using the spark.{driver/executor}.resource.{resourceType}.vendor config. The user does not need to explicitly add anything if you are using Pod templates. For reference and an example, you can see the Kubernetes documentation for scheduling GPUs. Spark only supports setting the resource limits.

Kubernetes does not tell Spark the addresses of the resources allocated to each container. For that reason, the user must specify a discovery script that gets run by the executor on startup to discover what resources are available to that executor. You can find an example scripts in examples/src/main/scripts/getGpusResources.sh. The script must have execute permissions set and the user should setup permissions to not allow malicious users to modify it. The script should write to STDOUT a JSON string in the format of the ResourceInformation class. This has the resource name and an array of resource addresses available to just that executor.

Resource Level Scheduling Overview

There are several resource level scheduling features supported by Spark on Kubernetes.

Priority Scheduling

Kubernetes supports Pod priority by default.

Spark on Kubernetes allows defining the priority of jobs by Pod template. The user can specify the priorityClassName in driver or executor Pod template spec section. Below is an example to show how to specify it:

apiVersion: v1
Kind: Pod
metadata:
  labels:
    template-label-key: driver-template-label-value
spec:
  # Specify the priority in here
  priorityClassName: system-node-critical
  containers:
  - name: test-driver-container
    image: will-be-overwritten

Customized Kubernetes Schedulers for Spark on Kubernetes

Spark allows users to specify a custom Kubernetes schedulers.

Specify a scheduler name.
Users can specify a custom scheduler using spark.kubernetes.scheduler.name or spark.kubernetes.{driver/executor}.scheduler.name configuration.
Specify scheduler related configurations.
To configure the custom scheduler the user can use Pod templates, add labels (spark.kubernetes.{driver,executor}.label.), annotations (spark.kubernetes.{driver/executor}.annotation.) or scheduler specific configurations (such as spark.kubernetes.scheduler.volcano.podGroupTemplateFile).
Specify scheduler feature step.
Users may also consider to use spark.kubernetes.{driver/executor}.pod.featureSteps to support more complex requirements, including but not limited to:
- Create additional Kubernetes custom resources for driver/executor scheduling.
- Set scheduler hints according to configuration or existing Pod info dynamically.

Using Volcano as Customized Scheduler for Spark on Kubernetes

Prerequisites

Spark on Kubernetes with Volcano as a custom scheduler is supported since Spark v3.3.0 and Volcano v1.7.0. Below is an example to install Volcano 1.7.0:
```
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.7.0/installer/volcano-development.yaml
```

Build

To create a Spark distribution along with Volcano suppport like those distributed by the Spark Downloads page, also see more in “Building Spark”:

./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phive -Phive-thriftserver -Pkubernetes -Pvolcano

Usage

Spark on Kubernetes allows using Volcano as a custom scheduler. Users can use Volcano to support more advanced resource scheduling: queue scheduling, resource reservation, priority scheduling, and more.

To use Volcano as a custom scheduler the user needs to specify the following configuration options:

# Specify volcano scheduler and PodGroup template
--conf spark.kubernetes.scheduler.name=volcano
--conf spark.kubernetes.scheduler.volcano.podGroupTemplateFile=/path/to/podgroup-template.yaml
# Specify driver/executor VolcanoFeatureStep
--conf spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep
--conf spark.kubernetes.executor.pod.featureSteps=org.apache.spark.deploy.k8s.features.VolcanoFeatureStep

Volcano Feature Step

Volcano feature steps help users to create a Volcano PodGroup and set driver/executor pod annotation to link with this PodGroup.

Note that currently only driver/job level PodGroup is supported in Volcano Feature Step.

Volcano PodGroup Template

Volcano defines PodGroup spec using CRD yaml.

Similar to Pod template, Spark users can use Volcano PodGroup Template to define the PodGroup spec configurations. To do so, specify the Spark property spark.kubernetes.scheduler.volcano.podGroupTemplateFile to point to files accessible to the spark-submit process. Below is an example of PodGroup template:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
spec:
  # Specify minMember to 1 to make a driver pod
  minMember: 1
  # Specify minResources to support resource reservation (the driver pod resource and executors pod resource should be considered)
  # It is useful for ensource the available resources meet the minimum requirements of the Spark job and avoiding the
  # situation where drivers are scheduled, and then they are unable to schedule sufficient executors to progress.
  minResources:
    cpu: "2"
    memory: "3Gi"
  # Specify the priority, help users to specify job priority in the queue during scheduling.
  priorityClassName: system-node-critical
  # Specify the queue, indicates the resource queue which the job should be submitted to
  queue: default

Using Apache YuniKorn as Customized Scheduler for Spark on Kubernetes

Apache YuniKorn is a resource scheduler for Kubernetes that provides advanced batch scheduling capabilities, such as job queuing, resource fairness, min/max queue capacity and flexible job ordering policies. For available Apache YuniKorn features, please refer to core features.

Prerequisites

Install Apache YuniKorn:

helm repo add yunikorn https://apache.github.io/yunikorn-release
helm repo update
helm install yunikorn yunikorn/yunikorn --namespace yunikorn --version 1.5.0 --create-namespace --set embedAdmissionController=false

The above steps will install YuniKorn v1.5.0 on an existing Kubernetes cluster.

Get started

Submit Spark jobs with the following extra options:

--conf spark.kubernetes.scheduler.name=yunikorn
--conf spark.kubernetes.driver.label.queue=root.default
--conf spark.kubernetes.executor.label.queue=root.default
--conf spark.kubernetes.driver.annotation.yunikorn.apache.org/app-id={% raw %}{{APP_ID}}{% endraw %}
--conf spark.kubernetes.executor.annotation.yunikorn.apache.org/app-id={% raw %}{{APP_ID}}{% endraw %}

Note that {% raw %}{{APP_ID}}{% endraw %} is the built-in variable that will be substituted with Spark job ID automatically. With the above configuration, the job will be scheduled by YuniKorn scheduler instead of the default Kubernetes scheduler.

Stage Level Scheduling Overview

Stage level scheduling is supported on Kubernetes:

When dynamic allocation is disabled: It allows users to specify different task resource requirements at the stage level and will use the same executors requested at startup.
When dynamic allocation is enabled: It allows users to specify task and executor resource requirements at the stage level and will request the extra executors. This also requires spark.dynamicAllocation.shuffleTracking.enabled to be enabled since Kubernetes doesn't support an external shuffle service at this time. The order in which containers for different profiles is requested from Kubernetes is not guaranteed. Note that since dynamic allocation on Kubernetes requires the shuffle tracking feature, this means that executors from previous stages that used a different ResourceProfile may not idle timeout due to having shuffle data on them. This could result in using more cluster resources and in the worst case if there are no remaining resources on the Kubernetes cluster then Spark could potentially hang. You may consider looking at config spark.dynamicAllocation.shuffleTracking.timeout to set a timeout, but that could result in data having to be recomputed if the shuffle data is really needed. Note, there is a difference in the way pod template resources are handled between the base default profile and custom ResourceProfiles. Any resources specified in the pod template file will only be used with the base default profile. If you create custom ResourceProfiles be sure to include all necessary resources there since the resources from the template file will not be propagated to custom ResourceProfiles.