docs/hadoop-provided.md - spark - Git at Google

 ---
 layout: global
 displayTitle: Using Spark's "Hadoop Free" Build
 title: Using Spark's "Hadoop Free" Build
 license: |
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 ---

 Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages "Hadoop free" builds that lets you more easily connect a single Spark binary to any Hadoop version. To use these builds, you need to modify `SPARK_DIST_CLASSPATH` to include Hadoop's package jars. The most convenient place to do this is by adding an entry in `conf/spark-env.sh`.

 This page describes how to connect Spark to Hadoop for different types of distributions.

 # Apache Hadoop
 For Apache distributions, you can use Hadoop's 'classpath' command. For instance:

 {% highlight bash %}
 ### in conf/spark-env.sh ###

 # If 'hadoop' binary is on your PATH
 export SPARK_DIST_CLASSPATH=$(hadoop classpath)

 # With explicit path to 'hadoop' binary
 export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)

 # Passing a Hadoop configuration directory
 export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)

 {% endhighlight %}

 # Hadoop Free Build Setup for Spark on Kubernetes
 To run the Hadoop free build of Spark on Kubernetes, the executor image must have the appropriate version of Hadoop binaries and the correct `SPARK_DIST_CLASSPATH` value set. See the example below for the relevant changes needed in the executor Dockerfile:

 {% highlight bash %}
 ### Set environment variables in the executor dockerfile ###

 ENV SPARK_HOME="/opt/spark"
 ENV HADOOP_HOME="/opt/hadoop"
 ENV PATH="$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH"
 ...

 #Copy your target hadoop binaries to the executor hadoop home

 COPY /opt/hadoop3  $HADOOP_HOME
 ...

 #Copy and use the Spark provided entrypoint.sh. It sets your SPARK_DIST_CLASSPATH using the hadoop binary in $HADOOP_HOME and starts the executor. If you choose to customize the value of SPARK_DIST_CLASSPATH here, the value will be retained in entrypoint.sh

 ENTRYPOINT [ "/opt/entrypoint.sh" ]
 ...
 {% endhighlight %}
	---
	layout: global
	displayTitle: Using Spark's "Hadoop Free" Build
	title: Using Spark's "Hadoop Free" Build
	license: \|
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	---

	Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages "Hadoop free" builds that lets you more easily connect a single Spark binary to any Hadoop version. To use these builds, you need to modify `SPARK_DIST_CLASSPATH` to include Hadoop's package jars. The most convenient place to do this is by adding an entry in `conf/spark-env.sh`.

	This page describes how to connect Spark to Hadoop for different types of distributions.

	# Apache Hadoop
	For Apache distributions, you can use Hadoop's 'classpath' command. For instance:

	{% highlight bash %}
	### in conf/spark-env.sh ###

	# If 'hadoop' binary is on your PATH
	export SPARK_DIST_CLASSPATH=$(hadoop classpath)

	# With explicit path to 'hadoop' binary
	export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)

	# Passing a Hadoop configuration directory
	export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)

	{% endhighlight %}

	# Hadoop Free Build Setup for Spark on Kubernetes
	To run the Hadoop free build of Spark on Kubernetes, the executor image must have the appropriate version of Hadoop binaries and the correct `SPARK_DIST_CLASSPATH` value set. See the example below for the relevant changes needed in the executor Dockerfile:

	{% highlight bash %}
	### Set environment variables in the executor dockerfile ###

	ENV SPARK_HOME="/opt/spark"
	ENV HADOOP_HOME="/opt/hadoop"
	ENV PATH="$SPARK_HOME/bin:$HADOOP_HOME/bin:$PATH"
	...

	#Copy your target hadoop binaries to the executor hadoop home

	COPY /opt/hadoop3 $HADOOP_HOME
	...

	#Copy and use the Spark provided entrypoint.sh. It sets your SPARK_DIST_CLASSPATH using the hadoop binary in $HADOOP_HOME and starts the executor. If you choose to customize the value of SPARK_DIST_CLASSPATH here, the value will be retained in entrypoint.sh

	ENTRYPOINT [ "/opt/entrypoint.sh" ]
	...
	{% endhighlight %}