layout: global title: Accessing OpenStack Swift from Spark

Spark's support for Hadoop InputFormat allows it to process data in OpenStack Swift using the same URI formats as in Hadoop. You can specify a path in Swift as input through a URI of the form swift://container.PROVIDER/path. You will also need to set your Swift security credentials, through core-site.xml or via SparkContext.hadoopConfiguration. The current Swift driver requires Swift to use the Keystone authentication method, or its Rackspace-specific predecessor.

Configuring Swift for Better Data Locality

Although not mandatory, it is recommended to configure the proxy server of Swift with list_endpoints to have better data locality. More information is available here.

Dependencies

The Spark application should include hadoop-openstack dependency, which can be done by including the hadoop-cloud module for the specific version of spark used. For example, for Maven support, add the following to the pom.xml file:

{% highlight xml %} ... org.apache.spark hadoop-cloud_2.11 ${spark.version} ... {% endhighlight %}

Configuration Parameters

Create core-site.xml and place it inside Spark's conf directory. The main category of parameters that should be configured is the authentication parameters required by Keystone.

The following table contains a list of Keystone mandatory parameters. PROVIDER can be any (alphanumeric) name.

For example, assume PROVIDER=SparkTest and Keystone contains user tester with password testing defined for tenant test. Then core-site.xml should include:

{% highlight xml %} fs.swift.service.SparkTest.auth.url http://127.0.0.1:5000/v2.0/tokens fs.swift.service.SparkTest.auth.endpoint.prefix endpoints fs.swift.service.SparkTest.http.port 8080 fs.swift.service.SparkTest.region RegionOne fs.swift.service.SparkTest.public true fs.swift.service.SparkTest.tenant test fs.swift.service.SparkTest.username tester fs.swift.service.SparkTest.password testing {% endhighlight %}

Notice that fs.swift.service.PROVIDER.tenant, fs.swift.service.PROVIDER.username, fs.swift.service.PROVIDER.password contains sensitive information and keeping them in core-site.xml is not always a good approach. We suggest to keep those parameters in core-site.xml for testing purposes when running Spark via spark-shell. For job submissions they should be provided via sparkContext.hadoopConfiguration.