Spark provides three locations to configure the system:
conf/spark-env.sh
script on each node.log4j.properties
.Spark properties control most application settings and are configured separately for each application. These properties can be set directly on a SparkConf passed to your SparkContext
. SparkConf
allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through the set()
method. For example, we could initialize an application with two threads as follows:
Note that we run with local[2], meaning two threads - which represents “minimal” parallelism, which can help detect bugs that only exist when we run in a distributed context.
{% highlight scala %} val conf = new SparkConf() .setMaster(“local[2]”) .setAppName(“CountingSheep”) val sc = new SparkContext(conf) {% endhighlight %}
Note that we can have more than 1 thread in local mode, and in cases like Spark Streaming, we may actually require more than 1 thread to prevent any sort of starvation issues.
Properties that specify some time duration should be configured with a unit of time. The following format is accepted:
25ms (milliseconds) 5s (seconds) 10m or 10min (minutes) 3h (hours) 5d (days) 1y (years)
Properties that specify a byte size should be configured with a unit of size.
The following format is accepted:
1b (bytes) 1k or 1kb (kibibytes = 1024 bytes) 1m or 1mb (mebibytes = 1024 kibibytes) 1g or 1gb (gibibytes = 1024 mebibytes) 1t or 1tb (tebibytes = 1024 gibibytes) 1p or 1pb (pebibytes = 1024 tebibytes)
In some cases, you may want to avoid hard-coding certain configurations in a SparkConf
. For instance, if you'd like to run the same application with different masters or different amounts of memory. Spark allows you to simply create an empty conf:
{% highlight scala %} val sc = new SparkContext(new SparkConf()) {% endhighlight %}
Then, you can supply configuration values at runtime: {% highlight bash %} ./bin/spark-submit --name “My app” --master local[4] --conf spark.eventLog.enabled=false --conf “spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps” myApp.jar {% endhighlight %}
The Spark shell and spark-submit
tool support two ways to load configurations dynamically. The first are command line options, such as --master
, as shown above. spark-submit
can accept any Spark property using the --conf
flag, but uses special flags for properties that play a part in launching the Spark application. Running ./bin/spark-submit --help
will show the entire list of these options.
bin/spark-submit
will also read configuration options from conf/spark-defaults.conf
, in which each line consists of a key and a value separated by whitespace. For example:
spark.master spark://5.6.7.8:7077 spark.executor.memory 4g spark.eventLog.enabled true spark.serializer org.apache.spark.serializer.KryoSerializer
Any values specified as flags or in the properties file will be passed on to the application and merged with those specified through SparkConf. Properties set directly on the SparkConf take highest precedence, then flags passed to spark-submit
or spark-shell
, then options in the spark-defaults.conf
file. A few configuration keys have been renamed since earlier versions of Spark; in such cases, the older key names are still accepted, but take lower precedence than any instance of the newer key.
The application web UI at http://<driver>:4040
lists Spark properties in the “Environment” tab. This is a useful place to check to make sure that your properties have been set correctly. Note that only values explicitly specified through spark-defaults.conf
, SparkConf
, or the command line will appear. For all other configuration properties, you can assume the default value is used.
Most of the properties that control internal settings have reasonable default values. Some of the most common options to set are:
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code> directly in your application, because the driver JVM has already started at that point. Instead, please set this through the <code>--driver-memory</code> command line option or in your default properties file.
NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager.
Apart from these, the following properties are also available, and may be useful in some situations:
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code> directly in your application, because the driver JVM has already started at that point. Instead, please set this through the <code>--driver-class-path</code> command line option or in your default properties file.</td>
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code> directly in your application, because the driver JVM has already started at that point. Instead, please set this through the <code>--driver-java-options</code> command line option or in your default properties file.</td>
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code> directly in your application, because the driver JVM has already started at that point. Instead, please set this through the <code>--driver-library-path</code> command line option or in your default properties file.</td>
This is used in cluster mode only.
By default the <code>pyspark.profiler.BasicProfiler</code> will be used, but this can be overridden by passing a profiler class in as a parameter to the <code>SparkContext</code> constructor.
In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
<p>All the SSL settings like <code>spark.ssl.xxx</code> where <code>xxx</code> is a particular configuration property, denote the global configuration for all the supported protocols. In order to override the global configuration for the particular protocol, the properties must be overwritten in the protocol-specific namespace.</p> <p>Use <code>spark.ssl.YYY.XXX</code> settings to overwrite the global configuration for particular protocol denoted by <code>YYY</code>. Currently <code>YYY</code> can be either <code>akka</code> for Akka based connections or <code>fs</code> for broadcast and file server.</p> </td> </tr> <tr> <td><code>spark.ssl.enabledAlgorithms</code></td> <td>Empty</td> <td> A comma separated list of ciphers. The specified ciphers must be supported by JVM. The reference list of protocols one can find on <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">this</a> page. </td> </tr> <tr> <td><code>spark.ssl.keyPassword</code></td> <td>None</td> <td> A password to the private key in key-store. </td> </tr> <tr> <td><code>spark.ssl.keyStore</code></td> <td>None</td> <td> A path to a key-store file. The path can be absolute or relative to the directory where the component is started in. </td> </tr> <tr> <td><code>spark.ssl.keyStorePassword</code></td> <td>None</td> <td> A password to the key-store. </td> </tr> <tr> <td><code>spark.ssl.protocol</code></td> <td>None</td> <td> A protocol name. The protocol must be supported by JVM. The reference list of protocols one can find on <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">this</a> page. </td> </tr> <tr> <td><code>spark.ssl.trustStore</code></td> <td>None</td> <td> A path to a trust-store file. The path can be absolute or relative to the directory where the component is started in. </td> </tr> <tr> <td><code>spark.ssl.trustStorePassword</code></td> <td>None</td> <td> A password to the trust-store. </td> </tr>
Each cluster manager in Spark has additional configuration options. Configurations can be found on the pages for each mode:
Certain Spark settings can be configured through environment variables, which are read from the conf/spark-env.sh
script in the directory where Spark is installed (or conf/spark-env.cmd
on Windows). In Standalone and Mesos modes, this file can give machine specific information such as hostnames. It is also sourced when running local Spark applications or submission scripts.
Note that conf/spark-env.sh
does not exist by default when Spark is installed. However, you can copy conf/spark-env.sh.template
to create it. Make sure you make the copy executable.
The following variables can be set in spark-env.sh
:
In addition to the above, there are also options for setting up the Spark standalone cluster scripts, such as number of cores to use on each machine and maximum memory.
Since spark-env.sh
is a shell script, some of these can be set programmatically -- for example, you might compute SPARK_LOCAL_IP
by looking up the IP of a specific network interface.
Spark uses log4j for logging. You can configure it by adding a log4j.properties
file in the conf
directory. One way to start is to copy the existing log4j.properties.template
located there.
To specify a different configuration directory other than the default “SPARK_HOME/conf”, you can set SPARK_CONF_DIR. Spark will use the the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc) from this directory.
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark's classpath:
hdfs-site.xml
, which provides default behaviors for the HDFS client.core-site.xml
, which sets the default filesystem name.The location of these configuration files varies across CDH and HDP versions, but a common location is inside of /etc/hadoop/conf
. Some tools, such as Cloudera Manager, create configurations on-the-fly, but offer a mechanisms to download copies of them.
To make these files visible to Spark, set HADOOP_CONF_DIR
in $SPARK_HOME/spark-env.sh
to a location containing the configuration files.