layout: global title: Spark Connect Overview license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Building client-side Spark applications

In Apache Spark 3.4, Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere. It can be embedded in modern data applications, in IDEs, Notebooks and programming languages.

To get started, see Quickstart: Spark Connect.

How Spark Connect works

The Spark Connect client library is designed to simplify Spark application development. It is a thin API that can be embedded everywhere: in application servers, IDEs, notebooks, and programming languages. The Spark Connect API builds on Spark's DataFrame API using unresolved logical plans as a language-agnostic protocol between the client and the Spark driver.

The Spark Connect client translates DataFrame operations into unresolved logical query plans which are encoded using protocol buffers. These are sent to the server using the gRPC framework.

The Spark Connect endpoint embedded on the Spark Server receives and translates unresolved logical plans into Spark‘s logical plan operators. This is similar to parsing a SQL query, where attributes and relations are parsed and an initial parse plan is built. From there, the standard Spark execution process kicks in, ensuring that Spark Connect leverages all of Spark’s optimizations and enhancements. Results are streamed back to the client through gRPC as Apache Arrow-encoded row batches.

How Spark Connect client applications differ from classic Spark applications

One of the main design goals of Spark Connect is to enable a full separation and isolation of the client from the server. As a consequence, there are some changes that developers need to be aware of when using Spark Connect:

The client does not run in the same process as the Spark driver. This means that the client cannot directly access and interact with the driver JVM to manipulate the execution environment. In particular, in PySpark, the client does not use Py4J and thus the accessing the private fields holding the JVM implementation of DataFrame, Column, SparkSession, etc. is not possible (e.g. df._jdf).
By design, the Spark Connect protocol uses Sparks logical plans as the abstraction to be able to declaratively describe the operations to be executed on the server. Consequently, the Spark Connect protocol does not support all the execution APIs of Spark, most importantly RDDs.
Spark Connect provides a session-based client for its consumers. This means that the client does not have access to properties of the cluster that manipulate the environment for all connected clients. Most importantly, the client does not have access to the static Spark configuration or the SparkContext.

Operational benefits of Spark Connect

With this new architecture, Spark Connect mitigates several multi-tenant operational issues:

Stability: Applications that use too much memory will now only impact their own environment as they can run in their own processes. Users can define their own dependencies on the client and don't need to worry about potential conflicts with the Spark driver.

Upgradability: The Spark driver can now seamlessly be upgraded independently of applications, for example to benefit from performance improvements and security fixes. This means applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible.

Debuggability and observability: Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, applications can be monitored using the application's framework native metrics and logging libraries.

How to use Spark Connect

Starting with Spark 3.4, Spark Connect is available and supports PySpark and Scala applications. We will walk through how to run an Apache Spark server with Spark Connect and connect to it from a client application using the Spark Connect client library.

Download and start Spark server with Spark Connect

First, download Spark from the Download Apache Spark page. Spark Connect was introduced in Apache Spark version 3.4 so make sure you choose 3.4.0 or newer in the release drop down at the top of the page. Then choose your package type, typically “Pre-built for Apache Hadoop 3.3 and later”, and click the link to download.

Now extract the Spark package you just downloaded on your computer, for example:

{% highlight bash %} tar -xvf spark-{{site.SPARK_VERSION_SHORT}}-bin-hadoop3.tgz {% endhighlight %}

In a terminal window, go to the spark folder in the location where you extracted Spark before and run the start-connect-server.sh script to start Spark server with Spark Connect, like in this example:

{% highlight bash %} ./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.13:{{site.SPARK_VERSION_SHORT}} {% endhighlight %}

Note that we include a Spark Connect package (spark-connect_2.13:{{site.SPARK_VERSION_SHORT}}), when starting Spark server. This is required to use Spark Connect. Make sure to use the same version of the package as the Spark version you downloaded previously. In this example, Spark {{site.SPARK_VERSION_SHORT}} with Scala 2.13.

Now Spark server is running and ready to accept Spark Connect sessions from client applications. In the next section we will walk through how to use Spark Connect when writing client applications.

Use Spark Connect for interactive analysis

If you do not use one of the mechanisms outlined here, your Spark session will work just like before, without leveraging Spark Connect.

Set SPARK_REMOTE environment variable

If you set the SPARK_REMOTE environment variable on the client machine where your Spark client application is running and create a new Spark Session as in the following example, the session will be a Spark Connect session. With this approach, there is no code change needed to start using Spark Connect.

In a terminal window, set the SPARK_REMOTE environment variable to point to the local Spark server you started previously on your computer:

{% highlight bash %} export SPARK_REMOTE=“sc://localhost” {% endhighlight %}

And start the Spark shell as usual:

{% highlight bash %} ./bin/pyspark {% endhighlight %}

The PySpark shell is now connected to Spark using Spark Connect as indicated in the welcome message:

{% highlight python %} Client connected to the Spark Connect server at localhost {% endhighlight %}

Specify Spark Connect when creating Spark session

You can also specify that you want to use Spark Connect explicitly when you create a Spark session.

For example, you can launch the PySpark shell with Spark Connect as illustrated here.

To launch the PySpark shell with Spark Connect, simply include the remote parameter and specify the location of your Spark server. We are using localhost in this example to connect to the local Spark server we started previously:

{% highlight bash %} ./bin/pyspark --remote “sc://localhost” {% endhighlight %}

And you will notice that the PySpark shell welcome message tells you that you have connected to Spark using Spark Connect:

{% highlight python %} Client connected to the Spark Connect server at localhost {% endhighlight %}

You can also check the Spark session type. If it includes .connect. you are using Spark Connect as shown in this example:

{% highlight python %} SparkSession available as ‘spark’.

type(spark) <class ‘pyspark.sql.connect.session.SparkSession’> {% endhighlight %}

Now you can run PySpark code in the shell to see Spark Connect in action:

{% highlight python %}

columns = [“id”,“name”] data = [(1,“Sarah”),(2,“Maria”)] df = spark.createDataFrame(data).toDF(*columns) df.show() +---+-----+ | id| name| +---+-----+ | 1|Sarah| | 2|Maria| +---+-----+ {% endhighlight %}

To set up the new Scala shell, first download and install Coursier CLI. Then, install the REPL using the following command in a terminal window: {% highlight bash %} cs install --contrib spark-connect-repl {% endhighlight %}

And now you can start the Ammonite-based Scala REPL/shell to connect to your Spark server like this:

{% highlight bash %} spark-connect-repl {% endhighlight %}

A greeting message will appear when the REPL successfully initializes: {% highlight bash %} Spark session available as ‘spark’.

/ /_ ____ / / / / ____ ____ ___ ___/ / _ / __ / __ `/ / /// / / / __ / __ / __ / _ / / / / / // / // / / / ,< / // // / / / / / / / / // / // ./_,// //|| _/_// /// //_/_/_/ // {% endhighlight %}

By default, the REPL will attempt to connect to a local Spark Server. Run the following Scala code in the shell to see Spark Connect in action:

{% highlight scala %} @ spark.range(10).count res0: Long = 10L {% endhighlight %}

Configure client-server connection

By default, the REPL will attempt to connect to a local Spark Server on port 15002. The connection, however, may be configured in several ways as described in this configuration reference.

Set SPARK_REMOTE environment variable

The SPARK_REMOTE environment variable can be set on the client machine to customize the client-server connection that is initialized at REPL startup.

{% highlight bash %} export SPARK_REMOTE=“sc://myhost.com:443/;token=ABCDEFG” spark-connect-repl {% endhighlight %} or {% highlight bash %} SPARK_REMOTE=“sc://myhost.com:443/;token=ABCDEFG” spark-connect-repl {% endhighlight %}

Use CLI arguments

The customizations may also be passed in through CLI arguments as shown below: {% highlight bash %} spark-connect-repl --host myhost.com --port 443 --token ABCDEFG {% endhighlight %}

The supported list of CLI arguments may be found here.

Configure programmatically with a connection string

The connection may also be programmatically created using SparkSession#builder as in this example: {% highlight scala %} @ import org.apache.spark.sql.SparkSession @ val spark = SparkSession.builder.remote(“sc://localhost:443/;token=ABCDEFG”).build() {% endhighlight %}

Use Spark Connect in standalone applications

First, install PySpark with pip install pyspark[connect]==3.5.0 or if building a packaged PySpark application/library, add it your setup.py file as: {% highlight python %} install_requires=[ ‘pyspark[connect]==3.5.0’ ] {% endhighlight %}

When writing your own code, include the remote function with a reference to your Spark server when you create a Spark session, as in this example:

{% highlight python %} from pyspark.sql import SparkSession spark = SparkSession.builder.remote(“sc://localhost”).getOrCreate() {% endhighlight %}

For illustration purposes, we’ll create a simple Spark Connect application, SimpleApp.py: {% highlight python %} “““SimpleApp.py””” from pyspark.sql import SparkSession

logFile = “YOUR_SPARK_HOME/README.md” # Should be some file on your system spark = SparkSession.builder.remote(“sc://localhost”).appName(“SimpleApp”).getOrCreate() logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains(‘a’)).count() numBs = logData.filter(logData.value.contains(‘b’)).count()

print(“Lines with a: %i, lines with b: %i” % (numAs, numBs))

spark.stop() {% endhighlight %}

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.

We can run this application with the regular Python interpreter as follows: {% highlight python %}

Use the Python interpreter to run your application

$ python SimpleApp.py ... Lines with a: 72, lines with b: 39 {% endhighlight %}

When writing your own code, include the remote function with a reference to your Spark server when you create a Spark session, as in this example:

{% highlight scala %} import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().remote(“sc://localhost”).build() {% endhighlight %}

Note: Operations that reference User Defined Code such as UDFs, filter, map, etc require a ClassFinder to be registered to pickup and upload any required classfiles. Also, any JAR dependencies must be uploaded to the server using SparkSession#AddArtifact.

Example: {% highlight scala %} import org.apache.spark.sql.connect.client.REPLClassDirMonitor // Register a ClassFinder to monitor and upload the classfiles from the build output. val classFinder = new REPLClassDirMonitor(<ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR>) spark.registerClassFinder(classfinder)

// Upload JAR dependencies spark.addArtifact(<ABSOLUTE_PATH_JAR_DEP>) {% endhighlight %} Here, ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR is the output directory where the build system writes classfiles into and ABSOLUTE_PATH_JAR_DEP is the location of the JAR on the local file system.

The REPLClassDirMonitor is a provided implementation of ClassFinder that monitors a specific directory but one may implement their own class extending ClassFinder for customized search and monitoring.

Client application authentication

While Spark Connect does not have built-in authentication, it is designed to work seamlessly with your existing authentication infrastructure. Its gRPC HTTP/2 interface allows for the use of authenticating proxies, which makes it possible to secure Spark Connect without having to implement authentication logic in Spark directly.

What is supported in Spark 3.4

PySpark: In Spark 3.4, Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported. You can check which APIs are currently supported in the API reference documentation. Supported APIs are labeled “Supports Spark Connect” so you can check whether the APIs you are using are available before migrating existing code to Spark Connect.

Scala: In Spark 3.5, Spark Connect supports most Scala APIs, including Dataset, functions, Column, Catalog and KeyValueGroupedDataset.

User-Defined Functions (UDFs) are supported, by default for the shell and in standalone applications with additional set-up requirements.

Majority of the Streaming API is supported, including DataStreamReader, DataStreamWriter, StreamingQuery and StreamingQueryListener.

APIs such as SparkContext and RDD are deprecated in all Spark Connect versions.

Support for more APIs is planned for upcoming Spark releases.