layout: global displayTitle: Integration with Cloud Infrastructures title: Integration with Cloud Infrastructures description: Introduction to cloud storage support in Apache Spark SPARK_VERSION_SHORT license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
All major cloud providers offer persistent data storage in object stores. These are not classic “POSIX” file systems. In order to store hundreds of petabytes of data without any single points of failure, object stores replace the classic file system directory tree with a simpler model of object-name => data
. To enable remote access, operations on objects are usually offered as (slow) HTTP REST operations.
Spark can read and write data in object stores through filesystem connectors implemented in Hadoop or provided by the infrastructure suppliers themselves. These connectors make the object stores look almost like file systems, with directories and files and the classic operations on them such as list, delete and rename.
While the stores appear to be filesystems, underneath they are still object stores, and the difference is significant
They cannot be used as a direct replacement for a cluster filesystem such as HDFS except where this is explicitly stated.
Key differences are:
How does this affect Spark?
For these reasons, it is not always safe to use an object store as a direct destination of queries, or as an intermediate store in a chain of queries. Consult the documentation of the object store and its connector to determine which uses are considered safe.
As of 2021, the object stores of Amazon (S3), Google Cloud (GCS) and Microsoft (Azure Storage, ADLS Gen1, ADLS Gen2) are all consistent.
This means that as soon as a file is written/updated it can be listed, viewed and opened by other processes -and the latest version will be retrieved. This was a known issue with AWS S3, especially with 404 caching of HEAD requests made before an object was created.
Even so: none of the store connectors provide any guarantees as to how their clients cope with objects which are overwritten while a stream is reading them. Do not assume that the old file can be safely read, nor that there is any bounded time period for changes to become visible -or indeed, that the clients will not simply fail if a file being read is overwritten.
For this reason: avoid overwriting files where it is known/likely that other clients will be actively reading them.
Other object stores are inconsistent
This includes OpenStack Swift.
Such stores are not always safe to use as a destination of work -consult each store's specific documentation.
With the relevant libraries on the classpath and Spark configured with valid credentials, objects can be read or written by using their URLs as the path to data. For example sparkContext.textFile("s3a://landsat-pds/scene_list.gz")
will create an RDD of the file scene_list.gz
stored in S3, using the s3a connector.
To add the relevant libraries to an application's classpath, include the hadoop-cloud
module and its dependencies.
In Maven, add the following to the pom.xml
file, assuming spark.version
is set to the chosen version of Spark:
{% highlight xml %} ... org.apache.spark hadoop-cloud_{{site.SCALA_BINARY_VERSION}} ${spark.version} provided ... {% endhighlight %}
Commercial products based on Apache Spark generally directly set up the classpath for talking to cloud infrastructures, in which case this module may not be needed.
Spark jobs must authenticate with the object stores to access data within them.
spark-submit
reads the AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
and AWS_SESSION_TOKEN
environment variables and sets the associated authentication options for the s3n
and s3a
connectors to Amazon S3.core-site.xml
file.spark-defaults.conf
SparkConf
instance used to configure the application's SparkContext
.Important: never check authentication secrets into source code repositories, especially public ones
Consult the Hadoop documentation for the relevant configuration and security options.
Each cloud connector has its own set of configuration parameters, again, consult the relevant documentation.
For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter
v2 algorithm for performance; v1 for safety.
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2
This does less renaming at the end of a job than the “version 1” algorithm. As it still uses rename()
to commit files, it is unsafe to use when the object store does not have consistent metadata/listings.
The committer can also be set to ignore failures when cleaning up temporary files; this reduces the risk that a transient network problem is escalated into a job failure:
spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored true
The original v1 commit algorithm renames the output of successful tasks to a job attempt directory, and then renames all the files in that directory into the final destination during the job commit phase:
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 1
The slow performance of mimicked renames on Amazon S3 makes this algorithm very, very slow. The recommended solution to this is switch to an S3 “Zero Rename” committer (see below).
For reference, here are the performance and safety characteristics of different stores and connectors when renaming directories:
Store | Connector | Directory Rename Safety | Rename Performance |
---|---|---|---|
Amazon S3 | s3a | Unsafe | O(data) |
Azure Storage | wasb | Safe | O(files) |
Azure Datalake Gen 2 | abfs | Safe | O(1) |
Google Cloud Storage | gs | Mixed | O(files) |
"_temporary"
on a regular basis.For optimal performance when working with Parquet data use the following settings:
spark.hadoop.parquet.enable.summary-metadata false spark.sql.parquet.mergeSchema false spark.sql.parquet.filterPushdown true spark.sql.hive.metastorePartitionPruning true
These minimise the amount of data read during queries.
For best performance when working with ORC data, use these settings:
spark.sql.orc.filterPushdown true spark.sql.orc.splits.include.file.footer true spark.sql.orc.cache.stripe.details.size 10000 spark.sql.hive.metastorePartitionPruning true
Again, these minimise the amount of data read during queries.
Spark Streaming can monitor files added to object stores, by creating a FileInputDStream
to monitor a path in the store through a call to StreamingContext.textFileStream()
.
The time to scan for new files is proportional to the number of files under the path, not the number of new files, so it can become a slow operation. The size of the window needs to be set to handle this.
Files only appear in an object store once they are completely written; there is no need for a workflow of write-then-rename to ensure that files aren't picked up while they are still being written. Applications can write straight to the monitored directory.
Streams should only be checkpointed to a store implementing a fast and atomic rename()
operation. Otherwise the checkpointing may be slow and potentially unreliable.
As covered earlier, commit-by-rename is dangerous on any object store which exhibits eventual consistency (example: S3), and often slower than classic filesystem renames.
Some object store connectors provide custom committers to commit tasks and jobs without using rename. In versions of Spark built with Hadoop 3.1 or later, the S3A connector for AWS S3 is such a committer.
Instead of writing data to a temporary directory on the store for renaming, these committers write the files to the final destination, but do not issue the final POST command to make a large “multi-part” upload visible. Those operations are postponed until the job commit itself. As a result, task and job commit are much faster, and task failures do not affect the result.
To switch to the S3A committers, use a version of Spark was built with Hadoop 3.1 or later, and switch the committers through the following options.
spark.hadoop.fs.s3a.committer.name directory spark.sql.sources.commitProtocolClass org.apache.spark.internal.io.cloud.PathOutputCommitProtocol spark.sql.parquet.output.committer.class org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter
It has been tested with the most common formats supported by Spark.
mydataframe.write.format("parquet").save("s3a://bucket/destination")
More details on these committers can be found in the latest Hadoop documentation.
Note: depending upon the committer used, in-progress statistics may be under-reported with Hadoop versions before 3.3.1.
Here is the documentation on the standard connectors both from Apache and the cloud providers.