layout: global displayTitle: Spark Configuration title: Configuration

  • This will become a table of contents (this text will be scraped). {:toc}

Spark provides three locations to configure the system:

  • Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties.
  • Environment variables can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.
  • Logging can be configured through log4j.properties.

Spark Properties

Spark properties control most application settings and are configured separately for each application. These properties can be set directly on a SparkConf passed to your SparkContext. SparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through the set() method. For example, we could initialize an application with two threads as follows:

Note that we run with local[2], meaning two threads - which represents “minimal” parallelism, which can help detect bugs that only exist when we run in a distributed context.

{% highlight scala %} val conf = new SparkConf() .setMaster(“local[2]”) .setAppName(“CountingSheep”) val sc = new SparkContext(conf) {% endhighlight %}

Note that we can have more than 1 thread in local mode, and in cases like Spark Streaming, we may actually require more than 1 thread to prevent any sort of starvation issues.

Properties that specify some time duration should be configured with a unit of time. The following format is accepted:

25ms (milliseconds)
5s (seconds)
10m or 10min (minutes)
3h (hours)
5d (days)
1y (years)

Properties that specify a byte size should be configured with a unit of size.
The following format is accepted:

1b (bytes)
1k or 1kb (kibibytes = 1024 bytes)
1m or 1mb (mebibytes = 1024 kibibytes)
1g or 1gb (gibibytes = 1024 mebibytes)
1t or 1tb (tebibytes = 1024 gibibytes)
1p or 1pb (pebibytes = 1024 tebibytes)

Dynamically Loading Spark Properties

In some cases, you may want to avoid hard-coding certain configurations in a SparkConf. For instance, if you'd like to run the same application with different masters or different amounts of memory. Spark allows you to simply create an empty conf:

{% highlight scala %} val sc = new SparkContext(new SparkConf()) {% endhighlight %}

Then, you can supply configuration values at runtime: {% highlight bash %} ./bin/spark-submit --name “My app” --master local[4] --conf spark.eventLog.enabled=false --conf “spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps” myApp.jar {% endhighlight %}

The Spark shell and spark-submit tool support two ways to load configurations dynamically. The first are command line options, such as --master, as shown above. spark-submit can accept any Spark property using the --conf flag, but uses special flags for properties that play a part in launching the Spark application. Running ./bin/spark-submit --help will show the entire list of these options.

bin/spark-submit will also read configuration options from conf/spark-defaults.conf, in which each line consists of a key and a value separated by whitespace. For example:

spark.master            spark://5.6.7.8:7077
spark.executor.memory   4g
spark.eventLog.enabled  true
spark.serializer        org.apache.spark.serializer.KryoSerializer

Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit or spark-shell, then options in the spark-defaults.conf file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.

Viewing Spark Properties

The application web UI at http://<driver>:4040 lists Spark properties in the “Environment” tab. This is a useful place to check to make sure that your properties have been set correctly. Note that only values explicitly specified through spark-defaults.conf, SparkConf, or the command line will appear. For all other configuration properties, you can assume the default value is used.

Available Properties

Most of the properties that control internal settings have reasonable default values. Some of the most common options to set are:

Application Properties

<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
Instead, please set this through the <code>--driver-memory</code> command line option
or in your default properties file.
NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or
LOCAL_DIRS (YARN) environment variables set by the cluster manager.

Apart from these, the following properties are also available, and may be useful in some situations:

Runtime Environment

<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
Instead, please set this through the <code>--driver-class-path</code> command line option or in
your default properties file.</td>
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
Instead, please set this through the <code>--driver-java-options</code> command line option or in
your default properties file.</td>
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
Instead, please set this through the <code>--driver-library-path</code> command line option or in
your default properties file.</td>
This is used in cluster mode only.
By default the <code>pyspark.profiler.BasicProfiler</code> will be used, but this can be overridden by
passing a profiler class in as a parameter to the <code>SparkContext</code> constructor.

Shuffle Behavior

Spark UI

Compression and Serialization

Memory Management

Execution Behavior

In standalone mode, setting this parameter allows an application to run multiple executors on
the same worker, provided that there are enough cores on that worker. Otherwise, only one
executor per application will run on each worker.

Networking

Scheduling

Dynamic Allocation

Security

Encryption

        <p>All the SSL settings like <code>spark.ssl.xxx</code> where <code>xxx</code> is a
        particular configuration property, denote the global configuration for all the supported
        protocols. In order to override the global configuration for the particular protocol,
        the properties must be overwritten in the protocol-specific namespace.</p>

        <p>Use <code>spark.ssl.YYY.XXX</code> settings to overwrite the global configuration for
        particular protocol denoted by <code>YYY</code>. Currently <code>YYY</code> can be
        either <code>akka</code> for Akka based connections or <code>fs</code> for broadcast and
        file server.</p>
    </td>
</tr>
<tr>
    <td><code>spark.ssl.enabledAlgorithms</code></td>
    <td>Empty</td>
    <td>
        A comma separated list of ciphers. The specified ciphers must be supported by JVM.
        The reference list of protocols one can find on
        <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">this</a>
        page.
    </td>
</tr>
<tr>
    <td><code>spark.ssl.keyPassword</code></td>
    <td>None</td>
    <td>
        A password to the private key in key-store.
    </td>
</tr>
<tr>
    <td><code>spark.ssl.keyStore</code></td>
    <td>None</td>
    <td>
        A path to a key-store file. The path can be absolute or relative to the directory where
        the component is started in.
    </td>
</tr>
<tr>
    <td><code>spark.ssl.keyStorePassword</code></td>
    <td>None</td>
    <td>
        A password to the key-store.
    </td>
</tr>
<tr>
    <td><code>spark.ssl.protocol</code></td>
    <td>None</td>
    <td>
        A protocol name. The protocol must be supported by JVM. The reference list of protocols
        one can find on <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">this</a>
        page.
    </td>
</tr>
<tr>
    <td><code>spark.ssl.trustStore</code></td>
    <td>None</td>
    <td>
        A path to a trust-store file. The path can be absolute or relative to the directory
        where the component is started in.
    </td>
</tr>
<tr>
    <td><code>spark.ssl.trustStorePassword</code></td>
    <td>None</td>
    <td>
        A password to the trust-store.
    </td>
</tr>

Spark Streaming

SparkR

Cluster Managers

Each cluster manager in Spark has additional configuration options. Configurations can be found on the pages for each mode:

YARN
Mesos
Standalone Mode

Environment Variables

Certain Spark settings can be configured through environment variables, which are read from the conf/spark-env.sh script in the directory where Spark is installed (or conf/spark-env.cmd on Windows). In Standalone and Mesos modes, this file can give machine specific information such as hostnames. It is also sourced when running local Spark applications or submission scripts.

Note that conf/spark-env.sh does not exist by default when Spark is installed. However, you can copy conf/spark-env.sh.template to create it. Make sure you make the copy executable.

The following variables can be set in spark-env.sh:

In addition to the above, there are also options for setting up the Spark standalone cluster scripts, such as number of cores to use on each machine and maximum memory.

Since spark-env.sh is a shell script, some of these can be set programmatically -- for example, you might compute SPARK_LOCAL_IP by looking up the IP of a specific network interface.

Configuring Logging

Spark uses log4j for logging. You can configure it by adding a log4j.properties file in the conf directory. One way to start is to copy the existing log4j.properties.template located there.

Overriding configuration directory

To specify a different configuration directory other than the default “SPARK_HOME/conf”, you can set SPARK_CONF_DIR. Spark will use the the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc) from this directory.

Inheriting Hadoop Cluster Configuration

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark's classpath:

  • hdfs-site.xml, which provides default behaviors for the HDFS client.
  • core-site.xml, which sets the default filesystem name.

The location of these configuration files varies across CDH and HDP versions, but a common location is inside of /etc/hadoop/conf. Some tools, such as Cloudera Manager, create configurations on-the-fly, but offer a mechanisms to download copies of them.

To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh to a location containing the configuration files.