blob: d131245c5e5c4b204ecc90cf01f5f0cebfc6eb7a [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2021-06-15
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop Amazon Web Services support &#x2013; Hadoop-AWS module: Integration with Amazon Web Services</title>
<style type="text/css" media="all">
@import url("../../css/maven-base.css");
@import url("../../css/maven-theme.css");
@import url("../../css/site.css");
</style>
<link rel="stylesheet" href="../../css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20210615" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xleft">
<a href="http://www.apache.org/" class="externalLink">Apache</a>
&gt;
<a href="http://hadoop.apache.org/" class="externalLink">Hadoop</a>
&gt;
<a href="../../index.html">Apache Hadoop Amazon Web Services support</a>
&gt;
Hadoop-AWS module: Integration with Amazon Web Services
</div>
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
&nbsp;| Last Published: 2021-06-15
&nbsp;| Version: 3.3.1
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../../../index.html">Overview</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../../../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../../../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../../../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../../../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../../../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../../../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../../../hadoop-openstack/index.html">OpenStack Swift</a>
</li>
<li class="none">
<a href="../../../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../../../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../../../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../../../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../../../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../../../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../../../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../../../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../../../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../../../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../../../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../../../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../../../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../../../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../../../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="../../images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>Hadoop-AWS module: Integration with Amazon Web Services</h1>
<ul>
<li><a href="#Compatibility"> Compatibility</a>
<ul>
<li><a href="#Directory_Marker_Compatibility"> Directory Marker Compatibility</a></li></ul></li>
<li><a href="#Documents"> Documents</a></li>
<li><a href="#Overview"> Overview</a></li>
<li><a href="#Introducing_the_Hadoop_S3A_client."> Introducing the Hadoop S3A client.</a>
<ul>
<li><a href="#Other_S3_Connectors">Other S3 Connectors</a></li></ul></li>
<li><a href="#Getting_Started"> Getting Started</a></li>
<li><a href="#Warnings"> Warnings</a>
<ul>
<li><a href="#Warning_.231:_S3_Consistency_model">Warning #1: S3 Consistency model</a></li>
<li><a href="#Warning_.232:_Directories_are_mimicked">Warning #2: Directories are mimicked</a></li>
<li><a href="#Warning_.233:_Object_stores_have_different_authorization_models">Warning #3: Object stores have different authorization models</a></li>
<li><a href="#Warning_.234:_Your_AWS_credentials_are_very.2C_very_valuable">Warning #4: Your AWS credentials are very, very valuable</a></li></ul></li>
<li><a href="#Authenticating_with_S3"> Authenticating with S3</a>
<ul>
<li><a href="#Authentication_properties">Authentication properties</a></li>
<li><a href="#Authenticating_via_the_AWS_Environment_Variables"> Authenticating via the AWS Environment Variables</a></li>
<li><a href="#Changing_Authentication_Providers"> Changing Authentication Providers</a></li>
<li><a href="#EC2_IAM_Metadata_Authentication_with_InstanceProfileCredentialsProvider"> EC2 IAM Metadata Authentication with InstanceProfileCredentialsProvider</a></li>
<li><a href="#Using_Named_Profile_Credentials_with_ProfileCredentialsProvider"> Using Named Profile Credentials with ProfileCredentialsProvider</a></li>
<li><a href="#Using_Session_Credentials_with_TemporaryAWSCredentialsProvider"> Using Session Credentials with TemporaryAWSCredentialsProvider</a></li>
<li><a href="#Anonymous_Login_with_AnonymousAWSCredentialsProvider"> Anonymous Login with AnonymousAWSCredentialsProvider</a></li>
<li><a href="#Simple_name.2Fsecret_credentials_with_SimpleAWSCredentialsProvider.2A"> Simple name/secret credentials with SimpleAWSCredentialsProvider*</a></li></ul></li>
<li><a href="#Protecting_the_AWS_Credentials"> Protecting the AWS Credentials</a></li>
<li><a href="#Storing_secrets_with_Hadoop_Credential_Providers">Storing secrets with Hadoop Credential Providers</a>
<ul>
<li><a href="#Step_1:_Create_a_credential_file">Step 1: Create a credential file</a></li>
<li><a href="#Step_2:_Configure_the_hadoop.security.credential.provider.path_property">Step 2: Configure the hadoop.security.credential.provider.path property</a></li>
<li><a href="#Using_secrets_from_credential_providers">Using secrets from credential providers</a></li></ul></li>
<li><a href="#General_S3A_Client_configuration">General S3A Client configuration</a></li>
<li><a href="#Retry_and_Recovery">Retry and Recovery</a>
<ul>
<li><a href="#Unrecoverable_Problems:_Fail_Fast">Unrecoverable Problems: Fail Fast</a></li>
<li><a href="#Possibly_Recoverable_Problems:_Retry">Possibly Recoverable Problems: Retry</a></li>
<li><a href="#Only_retriable_on_idempotent_operations">Only retriable on idempotent operations</a></li>
<li><a href="#Throttled_requests_from_S3_and_Dynamo_DB">Throttled requests from S3 and Dynamo DB</a></li></ul></li>
<li><a href="#Handling_Read-During-Overwrite">Handling Read-During-Overwrite</a>
<ul>
<li><a href="#Change_detection_with_S3_Versions.">Change detection with S3 Versions.</a></li>
<li><a href="#Change_Detection_Modes.">Change Detection Modes.</a></li></ul></li>
<li><a href="#Configuring_different_S3_buckets_with_Per-Bucket_Configuration">Configuring different S3 buckets with Per-Bucket Configuration</a>
<ul>
<li><a href="#Customizing_S3A_secrets_held_in_credential_files">Customizing S3A secrets held in credential files</a></li>
<li><a href="#Using_Per-Bucket_Configuration_to_access_data_round_the_world">Using Per-Bucket Configuration to access data round the world</a></li></ul></li>
<li><a href="#How_S3A_writes_data_to_S3">How S3A writes data to S3</a>
<ul>
<li><a href="#Buffering_upload_data_on_disk_fs.s3a.fast.upload.buffer.3Ddisk">Buffering upload data on disk fs.s3a.fast.upload.buffer=disk</a></li>
<li><a href="#Buffering_upload_data_in_ByteBuffers:_fs.s3a.fast.upload.buffer.3Dbytebuffer">Buffering upload data in ByteBuffers: fs.s3a.fast.upload.buffer=bytebuffer</a></li>
<li><a href="#Buffering_upload_data_in_byte_arrays:_fs.s3a.fast.upload.buffer.3Darray">Buffering upload data in byte arrays: fs.s3a.fast.upload.buffer=array</a></li>
<li><a href="#Upload_Thread_Tuning">Upload Thread Tuning</a></li>
<li><a href="#Cleaning_up_after_partial_Upload_Failures">Cleaning up after partial Upload Failures</a></li>
<li><a href="#S3A_.E2.80.9Cfadvise.E2.80.9D_input_policy_support">S3A &#x201c;fadvise&#x201d; input policy support</a></li></ul></li>
<li><a href="#Metrics">Metrics</a></li>
<li><a href="#Other_Topics"> Other Topics</a>
<ul>
<li><a href="#Copying_Data_with_distcp"> Copying Data with distcp</a></li>
<li><a href="#Advanced_-_Custom_Signers"> Advanced - Custom Signers</a></li></ul></li></ul>
<div class="section">
<h2><a name="Compatibility"></a><a name="compatibility"></a> Compatibility</h2>
<div class="section">
<h3><a name="Directory_Marker_Compatibility"></a><a name="directory-marker-compatibility"></a> Directory Marker Compatibility</h3>
<ol style="list-style-type: decimal">
<li>
<p>This release can safely list/index/read S3 buckets where &#x201c;empty directory&#x201d; markers are retained.</p>
</li>
<li>
<p>This release can be configured to retain these directory makers at the expense of being backwards incompatible.</p>
</li>
</ol>
<p>Consult <a href="directory_markers.html">Controlling the S3A Directory Marker Behavior</a> for full details.</p></div></div>
<div class="section">
<h2><a name="Documents"></a><a name="documents"></a> Documents</h2>
<ul>
<li><a href="./encryption.html">Encryption</a></li>
<li><a href="./performance.html">Performance</a></li>
<li><a href="./s3guard.html">S3Guard</a></li>
<li><a href="./troubleshooting_s3a.html">Troubleshooting</a></li>
<li><a href="directory_markers.html">Controlling the S3A Directory Marker Behavior</a>.</li>
<li><a href="./committers.html">Committing work to S3 with the &#x201c;S3A Committers&#x201d;</a></li>
<li><a href="./committer_architecture.html">S3A Committers Architecture</a></li>
<li><a href="./assumed_roles.html">Working with IAM Assumed Roles</a></li>
<li><a href="./delegation_tokens.html">S3A Delegation Token Support</a></li>
<li><a href="delegation_token_architecture.html">S3A Delegation Token Architecture</a>.</li>
<li><a href="./testing.html">Testing</a></li>
</ul></div>
<div class="section">
<h2><a name="Overview"></a><a name="overview"></a> Overview</h2>
<p>Apache Hadoop&#x2019;s <tt>hadoop-aws</tt> module provides support for AWS integration. applications to easily use this support.</p>
<p>To include the S3A client in Apache Hadoop&#x2019;s default classpath:</p>
<ol style="list-style-type: decimal">
<li>
<p>Make sure that<tt>HADOOP_OPTIONAL_TOOLS</tt> in <tt>hadoop-env.sh</tt> includes <tt>hadoop-aws</tt> in its list of optional modules to add in the classpath.</p>
</li>
<li>
<p>For client side interaction, you can declare that relevant JARs must be loaded in your <tt>~/.hadooprc</tt> file:</p>
<div>
<div>
<pre class="source">hadoop_add_to_classpath_tools hadoop-aws
</pre></div></div>
</li>
</ol>
<p>The settings in this file does not propagate to deployed applications, but it will work for local clients such as the <tt>hadoop fs</tt> command.</p></div>
<div class="section">
<h2><a name="Introducing_the_Hadoop_S3A_client."></a><a name="introduction"></a> Introducing the Hadoop S3A client.</h2>
<p>Hadoop&#x2019;s &#x201c;S3A&#x201d; client offers high-performance IO against Amazon S3 object store and compatible implementations.</p>
<ul>
<li>Directly reads and writes S3 objects.</li>
<li>Compatible with standard S3 clients.</li>
<li>Compatible with files created by the older <tt>s3n://</tt> client and Amazon EMR&#x2019;s <tt>s3://</tt> client.</li>
<li>Supports partitioned uploads for many-GB objects.</li>
<li>Offers a high-performance random IO mode for working with columnar data such as Apache ORC and Apache Parquet files.</li>
<li>Uses Amazon&#x2019;s Java S3 SDK with support for latest S3 features and authentication schemes.</li>
<li>Supports authentication via: environment variables, Hadoop configuration properties, the Hadoop key management store and IAM roles.</li>
<li>Supports per-bucket configuration.</li>
<li>Supports S3 &#x201c;Server Side Encryption&#x201d; for both reading and writing: SSE-S3, SSE-KMS and SSE-C</li>
<li>Instrumented with Hadoop metrics.</li>
<li>Before S3 was consistent, provided a consistent view of inconsistent storage through <a href="./s3guard.html">S3Guard</a>.</li>
<li>
<p>Actively maintained by the open source community.</p>
</li>
</ul>
<div class="section">
<h3><a name="Other_S3_Connectors"></a>Other S3 Connectors</h3>
<p>There other Hadoop connectors to S3. Only S3A is actively maintained by the Hadoop project itself.</p>
<ol style="list-style-type: decimal">
<li>Apache&#x2019;s Hadoop&#x2019;s original <tt>s3://</tt> client. This is no longer included in Hadoop.</li>
<li>Amazon EMR&#x2019;s <tt>s3://</tt> client. This is from the Amazon EMR team, who actively maintain it.</li>
<li>Apache&#x2019;s Hadoop&#x2019;s <a href="./s3n.html"><tt>s3n:</tt> filesystem client</a>. This connector is no longer available: users must migrate to the newer <tt>s3a:</tt> client.</li>
</ol></div></div>
<div class="section">
<h2><a name="Getting_Started"></a><a name="getting_started"></a> Getting Started</h2>
<p>S3A depends upon two JARs, alongside <tt>hadoop-common</tt> and its dependencies.</p>
<ul>
<li><tt>hadoop-aws</tt> JAR.</li>
<li><tt>aws-java-sdk-bundle</tt> JAR.</li>
</ul>
<p>The versions of <tt>hadoop-common</tt> and <tt>hadoop-aws</tt> must be identical.</p>
<p>To import the libraries into a Maven build, add <tt>hadoop-aws</tt> JAR to the build dependencies; it will pull in a compatible aws-sdk JAR.</p>
<p>The <tt>hadoop-aws</tt> JAR <i>does not</i> declare any dependencies other than that dependencies unique to it, the AWS SDK JAR. This is simplify excluding/tuning Hadoop dependency JARs in downstream applications. The <tt>hadoop-client</tt> or <tt>hadoop-common</tt> dependency must be declared</p>
<div>
<div>
<pre class="source">&lt;properties&gt;
&lt;!-- Your exact Hadoop version here--&gt;
&lt;hadoop.version&gt;3.0.0&lt;/hadoop.version&gt;
&lt;/properties&gt;
&lt;dependencies&gt;
&lt;dependency&gt;
&lt;groupId&gt;org.apache.hadoop&lt;/groupId&gt;
&lt;artifactId&gt;hadoop-client&lt;/artifactId&gt;
&lt;version&gt;${hadoop.version}&lt;/version&gt;
&lt;/dependency&gt;
&lt;dependency&gt;
&lt;groupId&gt;org.apache.hadoop&lt;/groupId&gt;
&lt;artifactId&gt;hadoop-aws&lt;/artifactId&gt;
&lt;version&gt;${hadoop.version}&lt;/version&gt;
&lt;/dependency&gt;
&lt;/dependencies&gt;
</pre></div></div>
</div>
<div class="section">
<h2><a name="Warnings"></a><a name="warning"></a> Warnings</h2>
<p>Amazon S3 is an example of &#x201c;an object store&#x201d;. In order to achieve scalability and especially high availability, S3 has &#x2014;as many other cloud object stores have done&#x2014; relaxed some of the constraints which classic &#x201c;POSIX&#x201d; filesystems promise.</p>
<p>The <a href="./s3guard.html">S3Guard</a> feature attempts to address some of these, but it cannot do so completely. Do read these warnings and consider how they apply.</p>
<p>For further discussion on these topics, please consult <a href="../../../hadoop-project-dist/hadoop-common/filesystem/index.html">The Hadoop FileSystem API Definition</a>.</p>
<div class="section">
<h3><a name="Warning_.231:_S3_Consistency_model"></a>Warning #1: S3 Consistency model</h3>
<p>Amazon S3 is an example of &#x201c;an object store&#x201d;. In order to achieve scalability and especially high availability, S3 has &#x2014;as many other cloud object stores have done&#x2014; relaxed some of the constraints which classic &#x201c;POSIX&#x201d; filesystems promise.</p>
<p>Specifically</p>
<ol style="list-style-type: decimal">
<li>Files that are newly created from the Hadoop Filesystem APIs may not be immediately visible.</li>
<li>File delete and update operations may not immediately propagate. Old copies of the file may exist for an indeterminate time period.</li>
<li>Directory operations: <tt>delete()</tt> and <tt>rename()</tt> are implemented by recursive file-by-file operations. They take time at least proportional to the number of files, during which time partial updates may be visible. If the operations are interrupted, the filesystem is left in an intermediate state.</li>
</ol></div>
<div class="section">
<h3><a name="Warning_.232:_Directories_are_mimicked"></a>Warning #2: Directories are mimicked</h3>
<p>The S3A clients mimics directories by:</p>
<ol style="list-style-type: decimal">
<li>Creating a stub entry after a <tt>mkdirs</tt> call, deleting it when a file is added anywhere underneath</li>
<li>When listing a directory, searching for all objects whose path starts with the directory path, and returning them as the listing.</li>
<li>When renaming a directory, taking such a listing and asking S3 to copying the individual objects to new objects with the destination filenames.</li>
<li>When deleting a directory, taking such a listing and deleting the entries in batches.</li>
<li>When renaming or deleting directories, taking such a listing and working on the individual files.</li>
</ol>
<p>Here are some of the consequences:</p>
<ul>
<li>Directories may lack modification times. Parts of Hadoop relying on this can have unexpected behaviour. E.g. the <tt>AggregatedLogDeletionService</tt> of YARN will not remove the appropriate logfiles.</li>
<li>Directory listing can be slow. Use <tt>listFiles(path, recursive)</tt> for high performance recursive listings whenever possible.</li>
<li>It is possible to create files under files if the caller tries hard.</li>
<li>The time to rename a directory is proportional to the number of files underneath it (directory or indirectly) and the size of the files. (The copy is executed inside the S3 storage, so the time is independent of the bandwidth from client to S3).</li>
<li>Directory renames are not atomic: they can fail partway through, and callers cannot safely rely on atomic renames as part of a commit algorithm.</li>
<li>Directory deletion is not atomic and can fail partway through.</li>
</ul>
<p>The final three issues surface when using S3 as the immediate destination of work, as opposed to HDFS or other &#x201c;real&#x201d; filesystem.</p>
<p>The <a href="./committers.html">S3A committers</a> are the sole mechanism available to safely save the output of queries directly into S3 object stores through the S3A filesystem.</p></div>
<div class="section">
<h3><a name="Warning_.233:_Object_stores_have_different_authorization_models"></a>Warning #3: Object stores have different authorization models</h3>
<p>The object authorization model of S3 is much different from the file authorization model of HDFS and traditional file systems. The S3A client simply reports stub information from APIs that would query this metadata:</p>
<ul>
<li>File owner is reported as the current user.</li>
<li>File group also is reported as the current user. Prior to Apache Hadoop 2.8.0, file group was reported as empty (no group associated), which is a potential incompatibility problem for scripts that perform positional parsing of shell output and other clients that expect to find a well-defined group.</li>
<li>Directory permissions are reported as 777.</li>
<li>File permissions are reported as 666.</li>
</ul>
<p>S3A does not really enforce any authorization checks on these stub permissions. Users authenticate to an S3 bucket using AWS credentials. It&#x2019;s possible that object ACLs have been defined to enforce authorization at the S3 side, but this happens entirely within the S3 service, not within the S3A implementation.</p></div>
<div class="section">
<h3><a name="Warning_.234:_Your_AWS_credentials_are_very.2C_very_valuable"></a>Warning #4: Your AWS credentials are very, very valuable</h3>
<p>Your AWS credentials not only pay for services, they offer read and write access to the data. Anyone with the credentials can not only read your datasets &#x2014;they can delete them.</p>
<p>Do not inadvertently share these credentials through means such as:</p>
<ol style="list-style-type: decimal">
<li>Checking in to SCM any configuration files containing the secrets.</li>
<li>Logging them to a console, as they invariably end up being seen.</li>
<li>Including the secrets in bug reports.</li>
<li>Logging the <tt>AWS_</tt> environment variables.</li>
</ol>
<p>If you do any of these: change your credentials immediately!</p></div></div>
<div class="section">
<h2><a name="Authenticating_with_S3"></a><a name="authenticating"></a> Authenticating with S3</h2>
<p>Except when interacting with public S3 buckets, the S3A client needs the credentials needed to interact with buckets.</p>
<p>The client supports multiple authentication mechanisms and can be configured as to which mechanisms to use, and their order of use. Custom implementations of <tt>com.amazonaws.auth.AWSCredentialsProvider</tt> may also be used.</p>
<p><i>Important</i>: The S3A connector no longer supports username and secrets in URLs of the form <tt>s3a://key:secret@bucket/</tt>. It is near-impossible to stop those secrets being logged &#x2014;which is why a warning has been printed since Hadoop 2.8 whenever such a URL was used.</p>
<div class="section">
<h3><a name="Authentication_properties"></a>Authentication properties</h3>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.access.key&lt;/name&gt;
&lt;description&gt;AWS access key ID.
Omit for IAM role-based or provider-based authentication.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.secret.key&lt;/name&gt;
&lt;description&gt;AWS secret key.
Omit for IAM role-based or provider-based authentication.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.aws.credentials.provider&lt;/name&gt;
&lt;description&gt;
Comma-separated class names of credential provider classes which implement
com.amazonaws.auth.AWSCredentialsProvider.
These are loaded and queried in sequence for a valid set of credentials.
Each listed class must implement one of the following means of
construction, which are attempted in order:
1. a public constructor accepting java.net.URI and
org.apache.hadoop.conf.Configuration,
2. a public static method named getInstance that accepts no
arguments and returns an instance of
com.amazonaws.auth.AWSCredentialsProvider, or
3. a public default constructor.
Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows
anonymous access to a publicly accessible S3 bucket without any credentials.
Please note that allowing anonymous access to an S3 bucket compromises
security and therefore is unsuitable for most use cases. It can be useful
for accessing public data sets without requiring AWS credentials.
If unspecified, then the default list of credential provider classes,
queried in sequence, is:
1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
configuration of AWS access key ID and secret access key in
environment variables named AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
of instance profile credentials if running in an EC2 VM.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.session.token&lt;/name&gt;
&lt;description&gt;
Session token, when using org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
as one of the providers.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="Authenticating_via_the_AWS_Environment_Variables"></a><a name="auth_env_vars"></a> Authenticating via the AWS Environment Variables</h3>
<p>S3A supports configuration via <a class="externalLink" href="http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-environment">the standard AWS environment variables</a>.</p>
<p>The core environment variables are for the access key and associated secret:</p>
<div>
<div>
<pre class="source">export AWS_ACCESS_KEY_ID=my.aws.key
export AWS_SECRET_ACCESS_KEY=my.secret.key
</pre></div></div>
<p>If the environment variable <tt>AWS_SESSION_TOKEN</tt> is set, session authentication using &#x201c;Temporary Security Credentials&#x201d; is enabled; the Key ID and secret key must be set to the credentials for that specific session.</p>
<div>
<div>
<pre class="source">export AWS_SESSION_TOKEN=SECRET-SESSION-TOKEN
export AWS_ACCESS_KEY_ID=SESSION-ACCESS-KEY
export AWS_SECRET_ACCESS_KEY=SESSION-SECRET-KEY
</pre></div></div>
<p>These environment variables can be used to set the authentication credentials instead of properties in the Hadoop configuration.</p>
<p><i>Important:</i> These environment variables are generally not propagated from client to server when YARN applications are launched. That is: having the AWS environment variables set when an application is launched will not permit the launched application to access S3 resources. The environment variables must (somehow) be set on the hosts/processes where the work is executed.</p></div>
<div class="section">
<h3><a name="Changing_Authentication_Providers"></a><a name="auth_providers"></a> Changing Authentication Providers</h3>
<p>The standard way to authenticate is with an access key and secret key set in the Hadoop configuration files.</p>
<p>By default, the S3A client follows the following authentication chain:</p>
<ol style="list-style-type: decimal">
<li>The options <tt>fs.s3a.access.key</tt>, <tt>fs.s3a.secret.key</tt> and <tt>fs.s3a.sesson.key</tt> are looked for in the Hadoop XML configuration/Hadoop credential providers, returning a set of session credentials if all three are defined.</li>
<li>The <tt>fs.s3a.access.key</tt> and <tt>fs.s3a.secret.key</tt> are looked for in the Hadoop XML configuration//Hadoop credential providers, returning a set of long-lived credentials if they are defined.</li>
<li>The <a class="externalLink" href="http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html#cli-environment">AWS environment variables</a>, are then looked for: these will return session or full credentials depending on which values are set.</li>
<li>An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs.</li>
</ol>
<p>S3A can be configured to obtain client authentication providers from classes which integrate with the AWS SDK by implementing the <tt>com.amazonaws.auth.AWSCredentialsProvider</tt> Interface. This is done by listing the implementation classes, in order of preference, in the configuration option <tt>fs.s3a.aws.credentials.provider</tt>.</p>
<p><i>Important</i>: AWS Credential Providers are distinct from <i>Hadoop Credential Providers</i>. As will be covered later, Hadoop Credential Providers allow passwords and other secrets to be stored and transferred more securely than in XML configuration files. AWS Credential Providers are classes which can be used by the Amazon AWS SDK to obtain an AWS login from a different source in the system, including environment variables, JVM properties and configuration files.</p>
<p>All Hadoop <tt>fs.s3a.</tt> options used to store login details can all be secured in <a href="../../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Hadoop credential providers</a>; this is advised as a more secure way to store valuable secrets.</p>
<p>There are a number of AWS Credential Providers inside the <tt>hadoop-aws</tt> JAR:</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> classname </th>
<th> description </th></tr>
</thead><tbody>
<tr class="b">
<td> <tt>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</tt></td>
<td> Session Credentials </td></tr>
<tr class="a">
<td> <tt>org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider</tt></td>
<td> Simple name/secret credentials </td></tr>
<tr class="b">
<td> <tt>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</tt></td>
<td> Anonymous Login </td></tr>
<tr class="a">
<td> <tt>org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider&lt;</tt></td>
<td> <a href="assumed_roles.html">Assumed Role credentials</a> </td></tr>
</tbody>
</table>
<p>There are also many in the Amazon SDKs, in particular two which are automatically set up in the authentication chain:</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> classname </th>
<th> description </th></tr>
</thead><tbody>
<tr class="b">
<td> <tt>com.amazonaws.auth.InstanceProfileCredentialsProvider</tt></td>
<td> EC2 Metadata Credentials </td></tr>
<tr class="a">
<td> <tt>com.amazonaws.auth.EnvironmentVariableCredentialsProvider</tt></td>
<td> AWS Environment Variables </td></tr>
</tbody>
</table></div>
<div class="section">
<h3><a name="EC2_IAM_Metadata_Authentication_with_InstanceProfileCredentialsProvider"></a><a name="auth_iam"></a> EC2 IAM Metadata Authentication with <tt>InstanceProfileCredentialsProvider</tt></h3>
<p>Applications running in EC2 may associate an IAM role with the VM and query the <a class="externalLink" href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html">EC2 Instance Metadata Service</a> for credentials to access S3. Within the AWS SDK, this functionality is provided by <tt>InstanceProfileCredentialsProvider</tt>, which internally enforces a singleton instance in order to prevent throttling problem.</p></div>
<div class="section">
<h3><a name="Using_Named_Profile_Credentials_with_ProfileCredentialsProvider"></a><a name="auth_named_profile"></a> Using Named Profile Credentials with <tt>ProfileCredentialsProvider</tt></h3>
<p>You can configure Hadoop to authenticate to AWS using a <a class="externalLink" href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html">named profile</a>.</p>
<p>To authenticate with a named profile:</p>
<ol style="list-style-type: decimal">
<li>Declare <tt>com.amazonaws.auth.profile.ProfileCredentialsProvider</tt> as the provider.</li>
<li>Set your profile via the <tt>AWS_PROFILE</tt> environment variable.</li>
<li>Due to a <a class="externalLink" href="https://github.com/aws/aws-sdk-java/issues/803">bug in version 1 of the AWS Java SDK</a>, you&#x2019;ll need to remove the <tt>profile</tt> prefix from the AWS configuration section heading.
<p>Here&#x2019;s an example of what your AWS configuration files should look like:</p>
<div>
<div>
<pre class="source">$ cat ~/.aws/config
[user1]
region = us-east-1
$ cat ~/.aws/credentials
[user1]
aws_access_key_id = ...
aws_secret_access_key = ...
aws_session_token = ...
aws_security_token = ...
</pre></div></div>
</li>
</ol></div>
<div class="section">
<h3><a name="Using_Session_Credentials_with_TemporaryAWSCredentialsProvider"></a><a name="auth_session"></a> Using Session Credentials with <tt>TemporaryAWSCredentialsProvider</tt></h3>
<p><a class="externalLink" href="http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp.html">Temporary Security Credentials</a> can be obtained from the Amazon Security Token Service; these consist of an access key, a secret key, and a session token.</p>
<p>To authenticate with these:</p>
<ol style="list-style-type: decimal">
<li>Declare <tt>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</tt> as the provider.</li>
<li>Set the session key in the property <tt>fs.s3a.session.token</tt>, and the access and secret key properties to those of this temporary session.</li>
</ol>
<p>Example:</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.aws.credentials.provider&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.access.key&lt;/name&gt;
&lt;value&gt;SESSION-ACCESS-KEY&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.secret.key&lt;/name&gt;
&lt;value&gt;SESSION-SECRET-KEY&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.session.token&lt;/name&gt;
&lt;value&gt;SECRET-SESSION-TOKEN&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The lifetime of session credentials are fixed when the credentials are issued; once they expire the application will no longer be able to authenticate to AWS.</p></div>
<div class="section">
<h3><a name="Anonymous_Login_with_AnonymousAWSCredentialsProvider"></a><a name="auth_anon"></a> Anonymous Login with <tt>AnonymousAWSCredentialsProvider</tt></h3>
<p>Specifying <tt>org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider</tt> allows anonymous access to a publicly accessible S3 bucket without any credentials. It can be useful for accessing public data sets without requiring AWS credentials.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.aws.credentials.provider&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Once this is done, there&#x2019;s no need to supply any credentials in the Hadoop configuration or via environment variables.</p>
<p>This option can be used to verify that an object store does not permit unauthenticated access: that is, if an attempt to list a bucket is made using the anonymous credentials, it should fail &#x2014;unless explicitly opened up for broader access.</p>
<div>
<div>
<pre class="source">hadoop fs -ls \
-D fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider \
s3a://landsat-pds/
</pre></div></div>
<ol style="list-style-type: decimal">
<li>
<p>Allowing anonymous access to an S3 bucket compromises security and therefore is unsuitable for most use cases.</p>
</li>
<li>
<p>If a list of credential providers is given in <tt>fs.s3a.aws.credentials.provider</tt>, then the Anonymous Credential provider <i>must</i> come last. If not, credential providers listed after it will be ignored.</p>
</li>
</ol></div>
<div class="section">
<h3><a name="Simple_name.2Fsecret_credentials_with_SimpleAWSCredentialsProvider.2A"></a><a name="auth_simple"></a> Simple name/secret credentials with <tt>SimpleAWSCredentialsProvider</tt>*</h3>
<p>This is is the standard credential provider, which supports the secret key in <tt>fs.s3a.access.key</tt> and token in <tt>fs.s3a.secret.key</tt> values.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.aws.credentials.provider&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>This is the basic authenticator used in the default authentication chain.</p>
<p>This means that the default S3A authentication chain can be defined as</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.aws.credentials.provider&lt;/name&gt;
&lt;value&gt;
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider,
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider,
com.amazonaws.auth.EnvironmentVariableCredentialsProvider,
org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider
&lt;/value&gt;
&lt;description&gt;
Comma-separated class names of credential provider classes which implement
com.amazonaws.auth.AWSCredentialsProvider.
When S3A delegation tokens are not enabled, this list will be used
to directly authenticate with S3 and DynamoDB services.
When S3A Delegation tokens are enabled, depending upon the delegation
token binding it may be used
to communicate with the STS endpoint to request session/role
credentials.
These are loaded and queried in sequence for a valid set of credentials.
Each listed class must implement one of the following means of
construction, which are attempted in order:
* a public constructor accepting java.net.URI and
org.apache.hadoop.conf.Configuration,
* a public constructor accepting org.apache.hadoop.conf.Configuration,
* a public static method named getInstance that accepts no
arguments and returns an instance of
com.amazonaws.auth.AWSCredentialsProvider, or
* a public default constructor.
Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows
anonymous access to a publicly accessible S3 bucket without any credentials.
Please note that allowing anonymous access to an S3 bucket compromises
security and therefore is unsuitable for most use cases. It can be useful
for accessing public data sets without requiring AWS credentials.
If unspecified, then the default list of credential provider classes,
queried in sequence, is:
* org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider: looks
for session login secrets in the Hadoop configuration.
* org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
* com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
configuration of AWS access key ID and secret access key in
environment variables named AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,
and AWS_SESSION_TOKEN as documented in the AWS SDK.
* org.apache.hadoop.fs.s3a.auth.IAMInstanceCredentialsProvider: picks up
IAM credentials of any EC2 VM or AWS container in which the process is running.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div></div>
<div class="section">
<h2><a name="Protecting_the_AWS_Credentials"></a><a name="auth_security"></a> Protecting the AWS Credentials</h2>
<p>It is critical that you never share or leak your AWS credentials. Loss of credentials can leak/lose all your data, run up large bills, and significantly damage your organisation.</p>
<ol style="list-style-type: decimal">
<li>
<p>Never share your secrets.</p>
</li>
<li>
<p>Never commit your secrets into an SCM repository. The <a class="externalLink" href="https://github.com/awslabs/git-secrets">git secrets</a> can help here.</p>
</li>
<li>
<p>Never include AWS credentials in bug reports, files attached to them, or similar.</p>
</li>
<li>
<p>If you use the <tt>AWS_</tt> environment variables, your list of environment variables is equally sensitive.</p>
</li>
<li>
<p>Never use root credentials. Use IAM user accounts, with each user/application having its own set of credentials.</p>
</li>
<li>
<p>Use IAM permissions to restrict the permissions individual users and applications have. This is best done through roles, rather than configuring individual users.</p>
</li>
<li>
<p>Avoid passing in secrets to Hadoop applications/commands on the command line. The command line of any launched program is visible to all users on a Unix system (via <tt>ps</tt>), and preserved in command histories.</p>
</li>
<li>
<p>Explore using <a href="assumed_roles.html">IAM Assumed Roles</a> for role-based permissions management: a specific S3A connection can be made with a different assumed role and permissions from the primary user account.</p>
</li>
<li>
<p>Consider a workflow in which users and applications are issued with short-lived session credentials, configuring S3A to use these through the <tt>TemporaryAWSCredentialsProvider</tt>.</p>
</li>
<li>
<p>Have a secure process in place for cancelling and re-issuing credentials for users and applications. Test it regularly by using it to refresh credentials.</p>
</li>
<li>
<p>In installations where Kerberos is enabled, <a href="delegation_tokens.html">S3A Delegation Tokens</a> can be used to acquire short-lived session/role credentials and then pass them into the shared application. This can ensure that the long-lived secrets stay on the local system.</p>
</li>
</ol>
<p>When running in EC2, the IAM EC2 instance credential provider will automatically obtain the credentials needed to access AWS services in the role the EC2 VM was deployed as. This AWS credential provider is enabled in S3A by default.</p></div>
<div class="section">
<h2><a name="Storing_secrets_with_Hadoop_Credential_Providers"></a><a name="hadoop_credential_providers"></a>Storing secrets with Hadoop Credential Providers</h2>
<p>The Hadoop Credential Provider Framework allows secure &#x201c;Credential Providers&#x201d; to keep secrets outside Hadoop configuration files, storing them in encrypted files in local or Hadoop filesystems, and including them in requests.</p>
<p>The S3A configuration options with sensitive data (<tt>fs.s3a.secret.key</tt>, <tt>fs.s3a.access.key</tt>, <tt>fs.s3a.session.token</tt> and <tt>fs.s3a.server-side-encryption.key</tt>) can have their data saved to a binary file stored, with the values being read in when the S3A filesystem URL is used for data access. The reference to this credential provider then declared in the Hadoop configuration.</p>
<p>For additional reading on the Hadoop Credential Provider API see: <a href="../../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>.</p>
<p>The following configuration options can be stored in Hadoop Credential Provider stores.</p>
<div>
<div>
<pre class="source">fs.s3a.access.key
fs.s3a.secret.key
fs.s3a.session.token
fs.s3a.server-side-encryption.key
fs.s3a.server-side-encryption-algorithm
</pre></div></div>
<p>The first three are for authentication; the final two for <a href="./encryption.html">encryption</a>. Of the latter, only the encryption key can be considered &#x201c;sensitive&#x201d;. However, being able to include the algorithm in the credentials allows for a JCECKS file to contain all the options needed to encrypt new data written to S3.</p>
<div class="section">
<h3><a name="Step_1:_Create_a_credential_file"></a>Step 1: Create a credential file</h3>
<p>A credential file can be created on any Hadoop filesystem; when creating one on HDFS or a Unix filesystem the permissions are automatically set to keep the file private to the reader &#x2014;though as directory permissions are not touched, users should verify that the directory containing the file is readable only by the current user.</p>
<div>
<div>
<pre class="source">hadoop credential create fs.s3a.access.key -value 123 \
-provider jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks
hadoop credential create fs.s3a.secret.key -value 456 \
-provider jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks
</pre></div></div>
<p>A credential file can be listed, to see what entries are kept inside it</p>
<div>
<div>
<pre class="source">hadoop credential list -provider jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks
Listing aliases for CredentialProvider: jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks
fs.s3a.secret.key
fs.s3a.access.key
</pre></div></div>
<p>At this point, the credentials are ready for use.</p></div>
<div class="section">
<h3><a name="Step_2:_Configure_the_hadoop.security.credential.provider.path_property"></a>Step 2: Configure the <tt>hadoop.security.credential.provider.path</tt> property</h3>
<p>The URL to the provider must be set in the configuration property <tt>hadoop.security.credential.provider.path</tt>, either on the command line or in XML configuration files.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;hadoop.security.credential.provider.path&lt;/name&gt;
&lt;value&gt;jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks&lt;/value&gt;
&lt;description&gt;Path to interrogate for protected credentials.&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>Because this property only supplies the path to the secrets file, the configuration option itself is no longer a sensitive item.</p>
<p>The property <tt>hadoop.security.credential.provider.path</tt> is global to all filesystems and secrets. There is another property, <tt>fs.s3a.security.credential.provider.path</tt> which only lists credential providers for S3A filesystems. The two properties are combined into one, with the list of providers in the <tt>fs.s3a.</tt> property taking precedence over that of the <tt>hadoop.security</tt> list (i.e. they are prepended to the common list).</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.security.credential.provider.path&lt;/name&gt;
&lt;value /&gt;
&lt;description&gt;
Optional comma separated list of credential providers, a list
which is prepended to that set in hadoop.security.credential.provider.path
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>This was added to support binding different credential providers on a per bucket basis, without adding alternative secrets in the credential list. However, some applications (e.g Hive) prevent the list of credential providers from being dynamically updated by users. As per-bucket secrets are now supported, it is better to include per-bucket keys in JCEKS files and other sources of credentials.</p></div>
<div class="section">
<h3><a name="Using_secrets_from_credential_providers"></a>Using secrets from credential providers</h3>
<p>Once the provider is set in the Hadoop configuration, Hadoop commands work exactly as if the secrets were in an XML file.</p>
<div>
<div>
<pre class="source">hadoop distcp \
hdfs://nn1.example.com:9001/user/backup/007020615 s3a://glacier1/
hadoop fs -ls s3a://glacier1/
</pre></div></div>
<p>The path to the provider can also be set on the command line:</p>
<div>
<div>
<pre class="source">hadoop distcp \
-D hadoop.security.credential.provider.path=jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks \
hdfs://nn1.example.com:9001/user/backup/007020615 s3a://glacier1/
hadoop fs \
-D fs.s3a.security.credential.provider.path=jceks://hdfs@nn1.example.com:9001/user/backup/s3.jceks \
-ls s3a://glacier1/
</pre></div></div>
<p>Because the provider path is not itself a sensitive secret, there is no risk from placing its declaration on the command line.</p></div></div>
<div class="section">
<h2><a name="General_S3A_Client_configuration"></a><a name="general_configuration"></a>General S3A Client configuration</h2>
<p>All S3A client options are configured with options with the prefix <tt>fs.s3a.</tt>.</p>
<p>The client supports <a href="per_bucket_configuration">Per-bucket configuration</a> to allow different buckets to override the shared settings. This is commonly used to change the endpoint, encryption and authentication mechanisms of buckets. S3Guard options, various minor options.</p>
<p>Here are the S3A properties for use in production. The S3Guard options are documented in the <a href="./s3guard.html">S3Guard documents</a>; some testing-related options are covered in <a href="./testing.md">Testing</a>.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.connection.maximum&lt;/name&gt;
&lt;value&gt;15&lt;/value&gt;
&lt;description&gt;Controls the maximum number of simultaneous connections to S3.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.connection.ssl.enabled&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;description&gt;Enables or disables SSL connections to S3.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.endpoint&lt;/name&gt;
&lt;description&gt;AWS S3 endpoint to connect to. An up-to-date list is
provided in the AWS Documentation: regions and endpoints. Without this
property, the standard region (s3.amazonaws.com) is assumed.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.path.style.access&lt;/name&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;description&gt;Enable S3 path style access ie disabling the default virtual hosting behaviour.
Useful for S3A-compliant storage providers as it removes the need to set up DNS for virtual hosting.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.proxy.host&lt;/name&gt;
&lt;description&gt;Hostname of the (optional) proxy server for S3 connections.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.proxy.port&lt;/name&gt;
&lt;description&gt;Proxy server port. If this property is not set
but fs.s3a.proxy.host is, port 80 or 443 is assumed (consistent with
the value of fs.s3a.connection.ssl.enabled).&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.proxy.username&lt;/name&gt;
&lt;description&gt;Username for authenticating with proxy server.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.proxy.password&lt;/name&gt;
&lt;description&gt;Password for authenticating with proxy server.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.proxy.domain&lt;/name&gt;
&lt;description&gt;Domain for authenticating with proxy server.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.proxy.workstation&lt;/name&gt;
&lt;description&gt;Workstation for authenticating with proxy server.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.attempts.maximum&lt;/name&gt;
&lt;value&gt;20&lt;/value&gt;
&lt;description&gt;How many times we should retry commands on transient errors.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.connection.establish.timeout&lt;/name&gt;
&lt;value&gt;5000&lt;/value&gt;
&lt;description&gt;Socket connection setup timeout in milliseconds.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.connection.timeout&lt;/name&gt;
&lt;value&gt;200000&lt;/value&gt;
&lt;description&gt;Socket connection timeout in milliseconds.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.paging.maximum&lt;/name&gt;
&lt;value&gt;5000&lt;/value&gt;
&lt;description&gt;How many keys to request from S3 when doing
directory listings at a time.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.threads.max&lt;/name&gt;
&lt;value&gt;10&lt;/value&gt;
&lt;description&gt; Maximum number of concurrent active (part)uploads,
which each use a thread from the threadpool.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.socket.send.buffer&lt;/name&gt;
&lt;value&gt;8192&lt;/value&gt;
&lt;description&gt;Socket send buffer hint to amazon connector. Represented in bytes.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.socket.recv.buffer&lt;/name&gt;
&lt;value&gt;8192&lt;/value&gt;
&lt;description&gt;Socket receive buffer hint to amazon connector. Represented in bytes.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.threads.keepalivetime&lt;/name&gt;
&lt;value&gt;60&lt;/value&gt;
&lt;description&gt;Number of seconds a thread can be idle before being
terminated.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.max.total.tasks&lt;/name&gt;
&lt;value&gt;5&lt;/value&gt;
&lt;description&gt;Number of (part)uploads allowed to the queue before
blocking additional uploads.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.multipart.size&lt;/name&gt;
&lt;value&gt;64M&lt;/value&gt;
&lt;description&gt;How big (in bytes) to split upload or copy operations up into.
A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.multipart.threshold&lt;/name&gt;
&lt;value&gt;128MB&lt;/value&gt;
&lt;description&gt;How big (in bytes) to split upload or copy operations up into.
This also controls the partition size in renamed files, as rename() involves
copying the source file(s).
A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.multiobjectdelete.enable&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;description&gt;When enabled, multiple single-object delete requests are replaced by
a single 'delete multiple objects'-request, reducing the number of requests.
Beware: legacy S3-compatible object stores might not support this request.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.acl.default&lt;/name&gt;
&lt;description&gt;Set a canned ACL for newly created and copied objects. Value may be Private,
PublicRead, PublicReadWrite, AuthenticatedRead, LogDeliveryWrite, BucketOwnerRead,
or BucketOwnerFullControl.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.multipart.purge&lt;/name&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;description&gt;True if you want to purge existing multipart uploads that may not have been
completed/aborted correctly&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.multipart.purge.age&lt;/name&gt;
&lt;value&gt;86400&lt;/value&gt;
&lt;description&gt;Minimum age in seconds of multipart uploads to purge&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.signing-algorithm&lt;/name&gt;
&lt;description&gt;Override the default signing algorithm so legacy
implementations can still be used&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.server-side-encryption-algorithm&lt;/name&gt;
&lt;description&gt;Specify a server-side encryption algorithm for s3a: file system.
Unset by default. It supports the following values: 'AES256' (for SSE-S3), 'SSE-KMS'
and 'SSE-C'
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.server-side-encryption.key&lt;/name&gt;
&lt;description&gt;Specific encryption key to use if fs.s3a.server-side-encryption-algorithm
has been set to 'SSE-KMS' or 'SSE-C'. In the case of SSE-C, the value of this property
should be the Base64 encoded key. If you are using SSE-KMS and leave this property empty,
you'll be using your default's S3 KMS key, otherwise you should set this property to
the specific KMS key id.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.buffer.dir&lt;/name&gt;
&lt;value&gt;${hadoop.tmp.dir}/s3a&lt;/value&gt;
&lt;description&gt;Comma separated list of directories that will be used to buffer file
uploads to.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.block.size&lt;/name&gt;
&lt;value&gt;32M&lt;/value&gt;
&lt;description&gt;Block size to use when reading files using s3a: file system.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.user.agent.prefix&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
Sets a custom value that will be prepended to the User-Agent header sent in
HTTP requests to the S3 back-end by S3AFileSystem. The User-Agent header
always includes the Hadoop version number followed by a string generated by
the AWS SDK. An example is &quot;User-Agent: Hadoop 2.8.0, aws-sdk-java/1.10.6&quot;.
If this optional property is set, then its value is prepended to create a
customized User-Agent. For example, if this configuration property was set
to &quot;MyApp&quot;, then an example of the resulting User-Agent would be
&quot;User-Agent: MyApp, Hadoop 2.8.0, aws-sdk-java/1.10.6&quot;.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.impl&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.S3AFileSystem&lt;/value&gt;
&lt;description&gt;The implementation class of the S3A Filesystem&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.AbstractFileSystem.s3a.impl&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.S3A&lt;/value&gt;
&lt;description&gt;The implementation class of the S3A AbstractFileSystem.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.readahead.range&lt;/name&gt;
&lt;value&gt;64K&lt;/value&gt;
&lt;description&gt;Bytes to read ahead during a seek() before closing and
re-opening the S3 HTTP connection. This option will be overridden if
any call to setReadahead() is made to an open stream.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.list.version&lt;/name&gt;
&lt;value&gt;2&lt;/value&gt;
&lt;description&gt;Select which version of the S3 SDK's List Objects API to use.
Currently support 2 (default) and 1 (older API).&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.connection.request.timeout&lt;/name&gt;
&lt;value&gt;0&lt;/value&gt;
&lt;description&gt;
Time out on HTTP requests to the AWS service; 0 means no timeout.
Measured in seconds; the usual time suffixes are all supported
Important: this is the maximum duration of any AWS service call,
including upload and copy operations. If non-zero, it must be larger
than the time to upload multi-megabyte blocks to S3 from the client,
and to rename many-GB files. Use with care.
Values that are larger than Integer.MAX_VALUE milliseconds are
converged to Integer.MAX_VALUE milliseconds
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.bucket.probe&lt;/name&gt;
&lt;value&gt;0&lt;/value&gt;
&lt;description&gt;
The value can be 0 (default), 1 or 2.
When set to 0, bucket existence checks won't be done
during initialization thus making it faster.
Though it should be noted that when the bucket is not available in S3,
or if fs.s3a.endpoint points to the wrong instance of a private S3 store
consecutive calls like listing, read, write etc. will fail with
an UnknownStoreException.
When set to 1, the bucket existence check will be done using the
V1 API of the S3 protocol which doesn't verify the client's permissions
to list or read data in the bucket.
When set to 2, the bucket existence check will be done using the
V2 API of the S3 protocol which does verify that the
client has permission to read the bucket.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h2><a name="Retry_and_Recovery"></a><a name="retry_and_recovery"></a>Retry and Recovery</h2>
<p>The S3A client makes a best-effort attempt at recovering from network failures; this section covers the details of what it does.</p>
<p>The S3A divides exceptions returned by the AWS SDK into different categories, and chooses a different retry policy based on their type and whether or not the failing operation is idempotent.</p>
<div class="section">
<h3><a name="Unrecoverable_Problems:_Fail_Fast"></a>Unrecoverable Problems: Fail Fast</h3>
<ul>
<li>No object/bucket store: <tt>FileNotFoundException</tt></li>
<li>No access permissions: <tt>AccessDeniedException</tt></li>
<li>Network errors considered unrecoverable (<tt>UnknownHostException</tt>, <tt>NoRouteToHostException</tt>, <tt>AWSRedirectException</tt>).</li>
<li>Interruptions: <tt>InterruptedIOException</tt>, <tt>InterruptedException</tt>.</li>
<li>Rejected HTTP requests: <tt>InvalidRequestException</tt></li>
</ul>
<p>These are all considered unrecoverable: S3A will make no attempt to recover from them.</p></div>
<div class="section">
<h3><a name="Possibly_Recoverable_Problems:_Retry"></a>Possibly Recoverable Problems: Retry</h3>
<ul>
<li>Connection timeout: <tt>ConnectTimeoutException</tt>. Timeout before setting up a connection to the S3 endpoint (or proxy).</li>
<li>HTTP response status code 400, &#x201c;Bad Request&#x201d;</li>
</ul>
<p>The status code 400, Bad Request usually means that the request is unrecoverable; it&#x2019;s the generic &#x201c;No&#x201d; response. Very rarely it does recover, which is why it is in this category, rather than that of unrecoverable failures.</p>
<p>These failures will be retried with an exponential sleep interval set in <tt>fs.s3a.retry.interval</tt>, up to the limit set in <tt>fs.s3a.retry.limit</tt>.</p></div>
<div class="section">
<h3><a name="Only_retriable_on_idempotent_operations"></a>Only retriable on idempotent operations</h3>
<p>Some network failures are considered to be retriable if they occur on idempotent operations; there&#x2019;s no way to know if they happened after the request was processed by S3.</p>
<ul>
<li><tt>SocketTimeoutException</tt>: general network failure.</li>
<li><tt>EOFException</tt> : the connection was broken while reading data</li>
<li>&#x201c;No response from Server&#x201d; (443, 444) HTTP responses.</li>
<li>Any other AWS client, service or S3 exception.</li>
</ul>
<p>These failures will be retried with an exponential sleep interval set in <tt>fs.s3a.retry.interval</tt>, up to the limit set in <tt>fs.s3a.retry.limit</tt>.</p>
<p><i>Important</i>: DELETE is considered idempotent, hence: <tt>FileSystem.delete()</tt> and <tt>FileSystem.rename()</tt> will retry their delete requests on any of these failures.</p>
<p>The issue of whether delete should be idempotent has been a source of historical controversy in Hadoop.</p>
<ol style="list-style-type: decimal">
<li>In the absence of any other changes to the object store, a repeated DELETE request will eventually result in the named object being deleted; it&#x2019;s a no-op if reprocessed. As indeed, is <tt>Filesystem.delete()</tt>.</li>
<li>If another client creates a file under the path, it will be deleted.</li>
<li>Any filesystem supporting an atomic <tt>FileSystem.create(path, overwrite=false)</tt> operation to reject file creation if the path exists MUST NOT consider delete to be idempotent, because a <tt>create(path, false)</tt> operation will only succeed if the first <tt>delete()</tt> call has already succeeded.</li>
<li>And a second, retried <tt>delete()</tt> call could delete the new data.</li>
</ol>
<p>Because S3 is eventually consistent <i>and</i> doesn&#x2019;t support an atomic create-no-overwrite operation, the choice is more ambiguous.</p>
<p>Currently S3A considers delete to be idempotent because it is convenient for many workflows, including the commit protocols. Just be aware that in the presence of transient failures, more things may be deleted than expected. (For anyone who considers this to be the wrong decision: rebuild the <tt>hadoop-aws</tt> module with the constant <tt>S3AFileSystem.DELETE_CONSIDERED_IDEMPOTENT</tt> set to <tt>false</tt>).</p></div>
<div class="section">
<h3><a name="Throttled_requests_from_S3_and_Dynamo_DB"></a>Throttled requests from S3 and Dynamo DB</h3>
<p>When S3A or Dynamo DB returns a response indicating that requests from the caller are being throttled, an exponential back-off with an initial interval and a maximum number of requests.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.retry.throttle.limit&lt;/name&gt;
&lt;value&gt;${fs.s3a.attempts.maximum}&lt;/value&gt;
&lt;description&gt;
Number of times to retry any throttled request.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.retry.throttle.interval&lt;/name&gt;
&lt;value&gt;1000ms&lt;/value&gt;
&lt;description&gt;
Interval between retry attempts on throttled requests.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>Notes</p>
<ol style="list-style-type: decimal">
<li>There is also throttling taking place inside the AWS SDK; this is managed by the value <tt>fs.s3a.attempts.maximum</tt>.</li>
<li>Throttling events are tracked in the S3A filesystem metrics and statistics.</li>
<li>Amazon KMS may throttle a customer based on the total rate of uses of KMS <i>across all user accounts and applications</i>.</li>
</ol>
<p>Throttling of S3 requests is all too common; it is caused by too many clients trying to access the same shard of S3 Storage. This generally happen if there are too many reads, those being the most common in Hadoop applications. This problem is exacerbated by Hive&#x2019;s partitioning strategy used when storing data, such as partitioning by year and then month. This results in paths with little or no variation at their start, which ends up in all the data being stored in the same shard(s).</p>
<p>Here are some expensive operations; the more of these taking place against part of an S3 bucket, the more load it experiences. * Many clients trying to list directories or calling <tt>getFileStatus</tt> on paths (LIST and HEAD requests respectively) * The GET requests issued when reading data. * Random IO used when reading columnar data (ORC, Parquet) means that many more GET requests than a simple one-per-file read. * The number of active writes to that part of the S3 bucket.</p>
<p>A special case is when enough data has been written into part of an S3 bucket that S3 decides to split the data across more than one shard: this is believed to be one by some copy operation which can take some time. While this is under way, S3 clients access data under these paths will be throttled more than usual.</p>
<p>Mitigation strategies</p>
<ol style="list-style-type: decimal">
<li>Use separate buckets for intermediate data/different applications/roles.</li>
<li>Use significantly different paths for different datasets in the same bucket.</li>
<li>Increase the value of <tt>fs.s3a.retry.throttle.interval</tt> to provide longer delays between attempts.</li>
<li>Reduce the parallelism of the queries. The more tasks trying to access data in parallel, the more load.</li>
<li>Reduce <tt>fs.s3a.threads.max</tt> to reduce the amount of parallel operations performed by clients. !. Maybe: increase <tt>fs.s3a.readahead.range</tt> to increase the minimum amount of data asked for in every GET request, as well as how much data is skipped in the existing stream before aborting it and creating a new stream.</li>
<li>If the DynamoDB tables used by S3Guard are being throttled, increase the capacity through <tt>hadoop s3guard set-capacity</tt> (and pay more, obviously).</li>
<li>KMS: &#x201c;consult AWS about increasing your capacity&#x201d;.</li>
</ol></div></div>
<div class="section">
<h2><a name="Handling_Read-During-Overwrite"></a>Handling Read-During-Overwrite</h2>
<p>Read-during-overwrite is the condition where a writer overwrites a file while a reader has an open input stream on the file. Depending on configuration, the S3AFileSystem may detect this and throw a <tt>RemoteFileChangedException</tt> in conditions where the reader&#x2019;s input stream might otherwise silently switch over from reading bytes from the original version of the file to reading bytes from the new version.</p>
<p>The configurations items controlling this behavior are:</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.change.detection.source&lt;/name&gt;
&lt;value&gt;etag&lt;/value&gt;
&lt;description&gt;
Select which S3 object attribute to use for change detection.
Currently support 'etag' for S3 object eTags and 'versionid' for
S3 object version IDs. Use of version IDs requires object versioning to be
enabled for each S3 bucket utilized. Object versioning is disabled on
buckets by default. When version ID is used, the buckets utilized should
have versioning enabled before any data is written.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.change.detection.mode&lt;/name&gt;
&lt;value&gt;server&lt;/value&gt;
&lt;description&gt;
Determines how change detection is applied to alert to S3 objects
rewritten while being read. Value 'server' indicates to apply the attribute
constraint directly on GetObject requests to S3. Value 'client' means to do a
client-side comparison of the attribute value returned in the response. Value
'server' would not work with third-party S3 implementations that do not
support these constraints on GetObject. Values 'server' and 'client' generate
RemoteObjectChangedException when a mismatch is detected. Value 'warn' works
like 'client' but generates only a warning. Value 'none' will ignore change
detection completely.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.change.detection.version.required&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;description&gt;
Determines if S3 object version attribute defined by
fs.s3.change.detection.source should be treated as required. If true and the
referred attribute is unavailable in an S3 GetObject response,
NoVersionAttributeException is thrown. Setting to 'true' is encouraged to
avoid potential for inconsistent reads with third-party S3 implementations or
against S3 buckets that have object versioning disabled.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>In the default configuration, S3 object eTags are used to detect changes. When the filesystem retrieves a file from S3 using <a class="externalLink" href="https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html">Get Object</a>, it captures the eTag and uses that eTag in an <tt>If-Match</tt> condition on each subsequent request. If a concurrent writer has overwritten the file, the &#x2018;If-Match&#x2019; condition will fail and a <tt>RemoteFileChangedException</tt> will be thrown.</p>
<p>Even in this default configuration, a new write may not trigger this exception on an open reader. For example, if the reader only reads forward in the file then only a single S3 &#x2018;Get Object&#x2019; request is made and the full contents of the file are streamed from a single response. An overwrite of the file after the &#x2018;Get Object&#x2019; request would not be seen at all by a reader with an input stream that had already read the first byte. Seeks backward on the other hand can result in new &#x2018;Get Object&#x2019; requests that can trigger the <tt>RemoteFileChangedException</tt>.</p>
<p>Additionally, due to the eventual consistency of S3 in a read-after-overwrite scenario, visibility of a new write may be delayed, avoiding the <tt>RemoteFileChangedException</tt> for some readers. That said, if a reader does not see <tt>RemoteFileChangedException</tt>, they will have at least read a consistent view of a single version of the file (the version available when they started reading).</p>
<div class="section">
<h3><a name="Change_detection_with_S3_Versions."></a>Change detection with S3 Versions.</h3>
<p>It is possible to switch to using the <a class="externalLink" href="https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectVersioning.html">S3 object version id</a> instead of eTag as the change detection mechanism. Use of this option requires object versioning to be enabled on any S3 buckets used by the filesystem. The benefit of using version id instead of eTag is potentially reduced frequency of <tt>RemoteFileChangedException</tt>. With object versioning enabled, old versions of objects remain available after they have been overwritten. This means an open input stream will still be able to seek backwards after a concurrent writer has overwritten the file. The reader will retain their consistent view of the version of the file from which they read the first byte. Because the version ID is null for objects written prior to enablement of object versioning, <b>this option should only be used when the S3 buckets have object versioning enabled from the beginning.</b></p>
<p>Note: when you rename files the copied files may have a different version number.</p></div>
<div class="section">
<h3><a name="Change_Detection_Modes."></a>Change Detection Modes.</h3>
<p>Configurable change detection mode is the next option. Different modes are available primarily for compatibility with third-party S3 implementations which may not support all change detection mechanisms.</p>
<ul>
<li><tt>server</tt>: the version/etag check is performed on the server by adding extra headers to the <tt>GET</tt> request. This is the default.</li>
<li><tt>client</tt> : check on the client by comparing the eTag/version ID of a reopened file with the previous version. This is useful when the implementation doesn&#x2019;t support the <tt>If-Match</tt> header.</li>
<li><tt>warn</tt>: check on the client, but only warn on a mismatch, rather than fail.</li>
<li><tt>none</tt> do not check. Useful if the implementation doesn&#x2019;t provide eTag or version ID support at all or you would like to retain previous behavior where the reader&#x2019;s input stream silently switches over to the new object version (not recommended).</li>
</ul>
<p>The final option (<tt>fs.s3a.change.detection.version.required</tt>) is present primarily to ensure the filesystem doesn&#x2019;t silently ignore the condition where it is configured to use version ID on a bucket that doesn&#x2019;t have object versioning enabled or alternatively it is configured to use eTag on an S3 implementation that doesn&#x2019;t return eTags.</p>
<p>When <tt>true</tt> (default) and &#x2018;Get Object&#x2019; doesn&#x2019;t return eTag or version ID (depending on configured &#x2018;source&#x2019;), a <tt>NoVersionAttributeException</tt> will be thrown. When <tt>false</tt> and and eTag or version ID is not returned, the stream can be read, but without any version checking.</p></div></div>
<div class="section">
<h2><a name="Configuring_different_S3_buckets_with_Per-Bucket_Configuration"></a><a name="per_bucket_configuration"></a>Configuring different S3 buckets with Per-Bucket Configuration</h2>
<p>Different S3 buckets can be accessed with different S3A client configurations. This allows for different endpoints, data read and write strategies, as well as login details.</p>
<ol style="list-style-type: decimal">
<li>All <tt>fs.s3a</tt> options other than a small set of unmodifiable values (currently <tt>fs.s3a.impl</tt>) can be set on a per bucket basis.</li>
<li>The bucket specific option is set by replacing the <tt>fs.s3a.</tt> prefix on an option with <tt>fs.s3a.bucket.BUCKETNAME.</tt>, where <tt>BUCKETNAME</tt> is the name of the bucket.</li>
<li>When connecting to a bucket, all options explicitly set will override the base <tt>fs.s3a.</tt> values.</li>
</ol>
<p>As an example, a configuration could have a base configuration to use the IAM role information available when deployed in Amazon EC2.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.aws.credentials.provider&lt;/name&gt;
&lt;value&gt;com.amazonaws.auth.InstanceProfileCredentialsProvider&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>This will become the default authentication mechanism for S3A buckets.</p>
<p>A bucket <tt>s3a://nightly/</tt> used for nightly data can then be given a session key:</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.bucket.nightly.access.key&lt;/name&gt;
&lt;value&gt;AKAACCESSKEY-2&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.bucket.nightly.secret.key&lt;/name&gt;
&lt;value&gt;SESSIONSECRETKEY&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.bucket.nightly.session.token&lt;/name&gt;
&lt;value&gt;Short-lived-session-token&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.bucket.nightly.aws.credentials.provider&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Finally, the public <tt>s3a://landsat-pds/</tt> bucket can be accessed anonymously:</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.bucket.landsat-pds.aws.credentials.provider&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<div class="section">
<h3><a name="Customizing_S3A_secrets_held_in_credential_files"></a>Customizing S3A secrets held in credential files</h3>
<p>Secrets in JCEKS files or provided by other Hadoop credential providers can also be configured on a per bucket basis. The S3A client will look for the per-bucket secrets be</p>
<p>Consider a JCEKS file with six keys:</p>
<div>
<div>
<pre class="source">fs.s3a.access.key
fs.s3a.secret.key
fs.s3a.server-side-encryption-algorithm
fs.s3a.bucket.nightly.access.key
fs.s3a.bucket.nightly.secret.key
fs.s3a.bucket.nightly.session.token
fs.s3a.bucket.nightly.server-side-encryption.key
fs.s3a.bucket.nightly.server-side-encryption-algorithm
</pre></div></div>
<p>When accessing the bucket <tt>s3a://nightly/</tt>, the per-bucket configuration options for that bucket will be used, here the access keys and token, and including the encryption algorithm and key.</p></div>
<div class="section">
<h3><a name="Using_Per-Bucket_Configuration_to_access_data_round_the_world"></a><a name="per_bucket_endpoints"></a>Using Per-Bucket Configuration to access data round the world</h3>
<p>S3 Buckets are hosted in different &#x201c;regions&#x201d;, the default being &#x201c;US-East&#x201d;. The S3A client talks to this region by default, issuing HTTP requests to the server <tt>s3.amazonaws.com</tt>.</p>
<p>S3A can work with buckets from any region. Each region has its own S3 endpoint, documented <a class="externalLink" href="http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region">by Amazon</a>.</p>
<ol style="list-style-type: decimal">
<li>Applications running in EC2 infrastructure do not pay for IO to/from <i>local S3 buckets</i>. They will be billed for access to remote buckets. Always use local buckets and local copies of data, wherever possible.</li>
<li>The default S3 endpoint can support data IO with any bucket when the V1 request signing protocol is used.</li>
<li>When the V4 signing protocol is used, AWS requires the explicit region endpoint to be used &#x2014;hence S3A must be configured to use the specific endpoint. This is done in the configuration option <tt>fs.s3a.endpoint</tt>.</li>
<li>All endpoints other than the default endpoint only support interaction with buckets local to that S3 instance.</li>
</ol>
<p>While it is generally simpler to use the default endpoint, working with V4-signing-only regions (Frankfurt, Seoul) requires the endpoint to be identified. Expect better performance from direct connections &#x2014;traceroute will give you some insight.</p>
<p>If the wrong endpoint is used, the request may fail. This may be reported as a 301/redirect error, or as a 400 Bad Request: take these as cues to check the endpoint setting of a bucket.</p>
<p>Here is a list of properties defining all AWS S3 regions, current as of June 2017:</p>
<div>
<div>
<pre class="source">&lt;!--
This is the default endpoint, which can be used to interact
with any v2 region.
--&gt;
&lt;property&gt;
&lt;name&gt;central.endpoint&lt;/name&gt;
&lt;value&gt;s3.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;canada.endpoint&lt;/name&gt;
&lt;value&gt;s3.ca-central-1.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;frankfurt.endpoint&lt;/name&gt;
&lt;value&gt;s3.eu-central-1.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;ireland.endpoint&lt;/name&gt;
&lt;value&gt;s3-eu-west-1.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;london.endpoint&lt;/name&gt;
&lt;value&gt;s3.eu-west-2.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;mumbai.endpoint&lt;/name&gt;
&lt;value&gt;s3.ap-south-1.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;ohio.endpoint&lt;/name&gt;
&lt;value&gt;s3.us-east-2.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;oregon.endpoint&lt;/name&gt;
&lt;value&gt;s3-us-west-2.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;sao-paolo.endpoint&lt;/name&gt;
&lt;value&gt;s3-sa-east-1.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;seoul.endpoint&lt;/name&gt;
&lt;value&gt;s3.ap-northeast-2.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;singapore.endpoint&lt;/name&gt;
&lt;value&gt;s3-ap-southeast-1.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;sydney.endpoint&lt;/name&gt;
&lt;value&gt;s3-ap-southeast-2.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;tokyo.endpoint&lt;/name&gt;
&lt;value&gt;s3-ap-northeast-1.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;virginia.endpoint&lt;/name&gt;
&lt;value&gt;${central.endpoint}&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>This list can be used to specify the endpoint of individual buckets, for example for buckets in the central and EU/Ireland endpoints.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.bucket.landsat-pds.endpoint&lt;/name&gt;
&lt;value&gt;${central.endpoint}&lt;/value&gt;
&lt;description&gt;The endpoint for s3a://landsat-pds URLs&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.bucket.eu-dataset.endpoint&lt;/name&gt;
&lt;value&gt;${ireland.endpoint}&lt;/value&gt;
&lt;description&gt;The endpoint for s3a://eu-dataset URLs&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>Why explicitly declare a bucket bound to the central endpoint? It ensures that if the default endpoint is changed to a new region, data store in US-east is still reachable.</p></div></div>
<div class="section">
<h2><a name="How_S3A_writes_data_to_S3"></a><a name="upload"></a>How S3A writes data to S3</h2>
<p>The original S3A client implemented file writes by buffering all data to disk as it was written to the <tt>OutputStream</tt>. Only when the stream&#x2019;s <tt>close()</tt> method was called would the upload start.</p>
<p>This made output slow, especially on large uploads, and could even fill up the disk space of small (virtual) disks.</p>
<p>Hadoop 2.7 added the <tt>S3AFastOutputStream</tt> alternative, which Hadoop 2.8 expanded. It is now considered stable and has replaced the original <tt>S3AOutputStream</tt>, which is no longer shipped in hadoop.</p>
<p>The &#x201c;fast&#x201d; output stream</p>
<ol style="list-style-type: decimal">
<li>Uploads large files as blocks with the size set by <tt>fs.s3a.multipart.size</tt>. That is: the threshold at which multipart uploads begin and the size of each upload are identical.</li>
<li>Buffers blocks to disk (default) or in on-heap or off-heap memory.</li>
<li>Uploads blocks in parallel in background threads.</li>
<li>Begins uploading blocks as soon as the buffered data exceeds this partition size.</li>
<li>When buffering data to disk, uses the directory/directories listed in <tt>fs.s3a.buffer.dir</tt>. The size of data which can be buffered is limited to the available disk space.</li>
<li>Generates output statistics as metrics on the filesystem, including statistics of active and pending block uploads.</li>
<li>Has the time to <tt>close()</tt> set by the amount of remaining data to upload, rather than the total size of the file.</li>
</ol>
<p>Because it starts uploading while data is still being written, it offers significant benefits when very large amounts of data are generated. The in memory buffering mechanisms may also offer speedup when running adjacent to S3 endpoints, as disks are not used for intermediate data storage.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.fast.upload.buffer&lt;/name&gt;
&lt;value&gt;disk&lt;/value&gt;
&lt;description&gt;
The buffering mechanism to use.
Values: disk, array, bytebuffer.
&quot;disk&quot; will use the directories listed in fs.s3a.buffer.dir as
the location(s) to save data prior to being uploaded.
&quot;array&quot; uses arrays in the JVM heap
&quot;bytebuffer&quot; uses off-heap memory within the JVM.
Both &quot;array&quot; and &quot;bytebuffer&quot; will consume memory in a single stream up to the number
of blocks set by:
fs.s3a.multipart.size * fs.s3a.fast.upload.active.blocks.
If using either of these mechanisms, keep this value low
The total number of threads performing work across all threads is set by
fs.s3a.threads.max, with fs.s3a.max.total.tasks values setting the number of queued
work items.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.multipart.size&lt;/name&gt;
&lt;value&gt;100M&lt;/value&gt;
&lt;description&gt;How big (in bytes) to split upload or copy operations up into.
A suffix from the set {K,M,G,T,P} may be used to scale the numeric value.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.fast.upload.active.blocks&lt;/name&gt;
&lt;value&gt;8&lt;/value&gt;
&lt;description&gt;
Maximum Number of blocks a single output stream can have
active (uploading, or queued to the central FileSystem
instance's pool of queued operations.
This stops a single stream overloading the shared thread pool.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p><b>Notes</b></p>
<ul>
<li>
<p>If the amount of data written to a stream is below that set in <tt>fs.s3a.multipart.size</tt>, the upload is performed in the <tt>OutputStream.close()</tt> operation &#x2014;as with the original output stream.</p>
</li>
<li>
<p>The published Hadoop metrics monitor include live queue length and upload operation counts, so identifying when there is a backlog of work/ a mismatch between data generation rates and network bandwidth. Per-stream statistics can also be logged by calling <tt>toString()</tt> on the current stream.</p>
</li>
<li>
<p>Files being written are still invisible until the write completes in the <tt>close()</tt> call, which will block until the upload is completed.</p>
</li>
</ul>
<div class="section">
<h3><a name="Buffering_upload_data_on_disk_fs.s3a.fast.upload.buffer.3Ddisk"></a><a name="upload_disk"></a>Buffering upload data on disk <tt>fs.s3a.fast.upload.buffer=disk</tt></h3>
<p>When <tt>fs.s3a.fast.upload.buffer</tt> is set to <tt>disk</tt>, all data is buffered to local hard disks prior to upload. This minimizes the amount of memory consumed, and so eliminates heap size as the limiting factor in queued uploads &#x2014;exactly as the original &#x201c;direct to disk&#x201d; buffering.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.fast.upload.buffer&lt;/name&gt;
&lt;value&gt;disk&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.buffer.dir&lt;/name&gt;
&lt;value&gt;${hadoop.tmp.dir}/s3a&lt;/value&gt;
&lt;description&gt;Comma separated list of directories that will be used to buffer file
uploads to.&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>This is the default buffer mechanism. The amount of data which can be buffered is limited by the amount of available disk space.</p></div>
<div class="section">
<h3><a name="Buffering_upload_data_in_ByteBuffers:_fs.s3a.fast.upload.buffer.3Dbytebuffer"></a><a name="upload_bytebuffer"></a>Buffering upload data in ByteBuffers: <tt>fs.s3a.fast.upload.buffer=bytebuffer</tt></h3>
<p>When <tt>fs.s3a.fast.upload.buffer</tt> is set to <tt>bytebuffer</tt>, all data is buffered in &#x201c;Direct&#x201d; ByteBuffers prior to upload. This <i>may</i> be faster than buffering to disk, and, if disk space is small (for example, tiny EC2 VMs), there may not be much disk space to buffer with.</p>
<p>The ByteBuffers are created in the memory of the JVM, but not in the Java Heap itself. The amount of data which can be buffered is limited by the Java runtime, the operating system, and, for YARN applications, the amount of memory requested for each container.</p>
<p>The slower the upload bandwidth to S3, the greater the risk of running out of memory &#x2014;and so the more care is needed in <a href="#upload_thread_tuning">tuning the upload settings</a>.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.fast.upload.buffer&lt;/name&gt;
&lt;value&gt;bytebuffer&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="Buffering_upload_data_in_byte_arrays:_fs.s3a.fast.upload.buffer.3Darray"></a><a name="upload_array"></a>Buffering upload data in byte arrays: <tt>fs.s3a.fast.upload.buffer=array</tt></h3>
<p>When <tt>fs.s3a.fast.upload.buffer</tt> is set to <tt>array</tt>, all data is buffered in byte arrays in the JVM&#x2019;s heap prior to upload. This <i>may</i> be faster than buffering to disk.</p>
<p>The amount of data which can be buffered is limited by the available size of the JVM heap heap. The slower the write bandwidth to S3, the greater the risk of heap overflows. This risk can be mitigated by <a href="#upload_thread_tuning">tuning the upload settings</a>.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.fast.upload.buffer&lt;/name&gt;
&lt;value&gt;array&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="Upload_Thread_Tuning"></a><a name="upload_thread_tuning"></a>Upload Thread Tuning</h3>
<p>Both the <a href="#upload_array">Array</a> and <a href="#upload_bytebuffer">Byte buffer</a> buffer mechanisms can consume very large amounts of memory, on-heap or off-heap respectively. The <a href="#upload_disk">disk buffer</a> mechanism does not use much memory up, but will consume hard disk capacity.</p>
<p>If there are many output streams being written to in a single process, the amount of memory or disk used is the multiple of all stream&#x2019;s active memory/disk use.</p>
<p>Careful tuning may be needed to reduce the risk of running out memory, especially if the data is buffered in memory.</p>
<p>There are a number parameters which can be tuned:</p>
<ol style="list-style-type: decimal">
<li>
<p>The total number of threads available in the filesystem for data uploads <i>or any other queued filesystem operation</i>. This is set in <tt>fs.s3a.threads.max</tt></p>
</li>
<li>
<p>The number of operations which can be queued for execution:, <i>awaiting a thread</i>: <tt>fs.s3a.max.total.tasks</tt></p>
</li>
<li>
<p>The number of blocks which a single output stream can have active, that is: being uploaded by a thread, or queued in the filesystem thread queue: <tt>fs.s3a.fast.upload.active.blocks</tt></p>
</li>
<li>
<p>How long an idle thread can stay in the thread pool before it is retired: <tt>fs.s3a.threads.keepalivetime</tt></p>
</li>
</ol>
<p>When the maximum allowed number of active blocks of a single stream is reached, no more blocks can be uploaded from that stream until one or more of those active blocks&#x2019; uploads completes. That is: a <tt>write()</tt> call which would trigger an upload of a now full datablock, will instead block until there is capacity in the queue.</p>
<p>How does that come together?</p>
<ul>
<li>
<p>As the pool of threads set in <tt>fs.s3a.threads.max</tt> is shared (and intended to be used across all threads), a larger number here can allow for more parallel operations. However, as uploads require network bandwidth, adding more threads does not guarantee speedup.</p>
</li>
<li>
<p>The extra queue of tasks for the thread pool (<tt>fs.s3a.max.total.tasks</tt>) covers all ongoing background S3A operations (future plans include: parallelized rename operations, asynchronous directory operations).</p>
</li>
<li>
<p>When using memory buffering, a small value of <tt>fs.s3a.fast.upload.active.blocks</tt> limits the amount of memory which can be consumed per stream.</p>
</li>
<li>
<p>When using disk buffering a larger value of <tt>fs.s3a.fast.upload.active.blocks</tt> does not consume much memory. But it may result in a large number of blocks to compete with other filesystem operations.</p>
</li>
</ul>
<p>We recommend a low value of <tt>fs.s3a.fast.upload.active.blocks</tt>; enough to start background upload without overloading other parts of the system, then experiment to see if higher values deliver more throughput &#x2014;especially from VMs running on EC2.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.fast.upload.active.blocks&lt;/name&gt;
&lt;value&gt;4&lt;/value&gt;
&lt;description&gt;
Maximum Number of blocks a single output stream can have
active (uploading, or queued to the central FileSystem
instance's pool of queued operations.
This stops a single stream overloading the shared thread pool.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.threads.max&lt;/name&gt;
&lt;value&gt;10&lt;/value&gt;
&lt;description&gt;The total number of threads available in the filesystem for data
uploads *or any other queued filesystem operation*.&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.max.total.tasks&lt;/name&gt;
&lt;value&gt;5&lt;/value&gt;
&lt;description&gt;The number of operations which can be queued for execution&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.threads.keepalivetime&lt;/name&gt;
&lt;value&gt;60&lt;/value&gt;
&lt;description&gt;Number of seconds a thread can be idle before being
terminated.&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="Cleaning_up_after_partial_Upload_Failures"></a><a name="multipart_purge"></a>Cleaning up after partial Upload Failures</h3>
<p>There are two mechanisms for cleaning up after leftover multipart uploads: - Hadoop s3guard CLI commands for listing and deleting uploads by their age. Documented in the <a href="./s3guard.html">S3Guard</a> section. - The configuration parameter <tt>fs.s3a.multipart.purge</tt>, covered below.</p>
<p>If a large stream write operation is interrupted, there may be intermediate partitions uploaded to S3 &#x2014;data which will be billed for.</p>
<p>These charges can be reduced by enabling <tt>fs.s3a.multipart.purge</tt>, and setting a purge time in seconds, such as 86400 seconds &#x2014;24 hours. When an S3A FileSystem instance is instantiated with the purge time greater than zero, it will, on startup, delete all outstanding partition requests older than this time.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.multipart.purge&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;description&gt;True if you want to purge existing multipart uploads that may not have been
completed/aborted correctly&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.multipart.purge.age&lt;/name&gt;
&lt;value&gt;86400&lt;/value&gt;
&lt;description&gt;Minimum age in seconds of multipart uploads to purge&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>If an S3A client is instantiated with <tt>fs.s3a.multipart.purge=true</tt>, it will delete all out of date uploads <i>in the entire bucket</i>. That is: it will affect all multipart uploads to that bucket, from all applications.</p>
<p>Leaving <tt>fs.s3a.multipart.purge</tt> to its default, <tt>false</tt>, means that the client will not make any attempt to reset or change the partition rate.</p>
<p>The best practise for using this option is to disable multipart purges in normal use of S3A, enabling only in manual/scheduled housekeeping operations.</p></div>
<div class="section">
<h3><a name="S3A_.E2.80.9Cfadvise.E2.80.9D_input_policy_support"></a>S3A &#x201c;fadvise&#x201d; input policy support</h3>
<p>The S3A Filesystem client supports the notion of input policies, similar to that of the Posix <tt>fadvise()</tt> API call. This tunes the behavior of the S3A client to optimise HTTP GET requests for the different use cases.</p>
<p>See <a href="./performance.html#fadvise">Improving data input performance through fadvise</a> for the details.</p></div></div>
<div class="section">
<h2><a name="Metrics"></a><a name="metrics"></a>Metrics</h2>
<p>S3A metrics can be monitored through Hadoop&#x2019;s metrics2 framework. S3A creates its own metrics system called s3a-file-system, and each instance of the client will create its own metrics source, named with a JVM-unique numerical ID.</p>
<p>As a simple example, the following can be added to <tt>hadoop-metrics2.properties</tt> to write all S3A metrics to a log file every 10 seconds:</p>
<div>
<div>
<pre class="source">s3a-file-system.sink.my-metrics-config.class=org.apache.hadoop.metrics2.sink.FileSink
s3a-file-system.sink.my-metrics-config.filename=/var/log/hadoop-yarn/s3a-metrics.out
*.period=10
</pre></div></div>
<p>Lines in that file will be structured like the following:</p>
<div>
<div>
<pre class="source">1511208770680 s3aFileSystem.s3aFileSystem: Context=s3aFileSystem, s3aFileSystemId=892b02bb-7b30-4ffe-80ca-3a9935e1d96e, bucket=bucket,
Hostname=hostname-1.hadoop.apache.com, files_created=1, files_copied=2, files_copied_bytes=10000, files_deleted=5, fake_directories_deleted=3,
directories_created=3, directories_deleted=0, ignored_errors=0, op_copy_from_local_file=0, op_exists=0, op_get_file_status=15, op_glob_status=0,
op_is_directory=0, op_is_file=0, op_list_files=0, op_list_located_status=0, op_list_status=3, op_mkdirs=1, op_rename=2, object_copy_requests=0,
object_delete_requests=6, object_list_requests=23, object_continue_list_requests=0, object_metadata_requests=46, object_multipart_aborted=0,
object_put_bytes=0, object_put_requests=4, object_put_requests_completed=4, stream_write_failures=0, stream_write_block_uploads=0,
stream_write_block_uploads_committed=0, stream_write_block_uploads_aborted=0, stream_write_total_time=0, stream_write_total_data=0,
s3guard_metadatastore_put_path_request=10, s3guard_metadatastore_initialization=0, object_put_requests_active=0, object_put_bytes_pending=0,
stream_write_block_uploads_active=0, stream_write_block_uploads_pending=0, stream_write_block_uploads_data_pending=0,
S3guard_metadatastore_put_path_latencyNumOps=0, S3guard_metadatastore_put_path_latency50thPercentileLatency=0,
S3guard_metadatastore_put_path_latency75thPercentileLatency=0, S3guard_metadatastore_put_path_latency90thPercentileLatency=0,
S3guard_metadatastore_put_path_latency95thPercentileLatency=0, S3guard_metadatastore_put_path_latency99thPercentileLatency=0
</pre></div></div>
<p>Depending on other configuration, metrics from other systems, contexts, etc. may also get recorded, for example the following:</p>
<div>
<div>
<pre class="source">1511208770680 metricssystem.MetricsSystem: Context=metricssystem, Hostname=s3a-metrics-4.gce.cloudera.com, NumActiveSources=1, NumAllSources=1,
NumActiveSinks=1, NumAllSinks=0, Sink_fileNumOps=2, Sink_fileAvgTime=1.0, Sink_fileDropped=0, Sink_fileQsize=0, SnapshotNumOps=5,
SnapshotAvgTime=0.0, PublishNumOps=2, PublishAvgTime=0.0, DroppedPubAll=0
</pre></div></div>
<p>Note that low-level metrics from the AWS SDK itself are not currently included in these metrics.</p></div>
<div class="section">
<h2><a name="Other_Topics"></a><a name="further_reading"></a> Other Topics</h2>
<div class="section">
<h3><a name="Copying_Data_with_distcp"></a><a name="distcp"></a> Copying Data with distcp</h3>
<p>Hadoop&#x2019;s <tt>distcp</tt> tool is often used to copy data between a Hadoop cluster and Amazon S3. See <a class="externalLink" href="https://hortonworks.github.io/hdp-aws/s3-copy-data/index.html">Copying Data Between a Cluster and Amazon S3</a> for details on S3 copying specifically.</p>
<p>The <tt>distcp update</tt> command tries to do incremental updates of data. It is straightforward to verify when files do not match when they are of different length, but not when they are the same size.</p>
<p>Distcp addresses this by comparing file checksums on the source and destination filesystems, which it tries to do <i>even if the filesystems have incompatible checksum algorithms</i>.</p>
<p>The S3A connector can provide the HTTP etag header to the caller as the checksum of the uploaded file. Doing so will break distcp operations between hdfs and s3a.</p>
<p>For this reason, the etag-as-checksum feature is disabled by default.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.etag.checksum.enabled&lt;/name&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;description&gt;
Should calls to getFileChecksum() return the etag value of the remote
object.
WARNING: if enabled, distcp operations between HDFS and S3 will fail unless
-skipcrccheck is set.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>If enabled, <tt>distcp</tt> between two S3 buckets can use the checksum to compare objects. Their checksums should be identical if they were either each uploaded as a single file PUT, or, if in a multipart PUT, in blocks of the same size, as configured by the value <tt>fs.s3a.multipart.size</tt>.</p>
<p>To disable checksum verification in <tt>distcp</tt>, use the <tt>-skipcrccheck</tt> option:</p>
<div>
<div>
<pre class="source">hadoop distcp -update -skipcrccheck -numListstatusThreads 40 /user/alice/datasets s3a://alice-backup/datasets
</pre></div></div>
</div>
<div class="section">
<h3><a name="Advanced_-_Custom_Signers"></a><a name="customsigners"></a> Advanced - Custom Signers</h3>
<p>AWS uees request signing to authenticate requests. In general, there should be no need to override the signers, and the defaults work out of the box. If, however, this is required - this section talks about how to configure custom signers. There&#x2019;s 2 broad config categories to be set - one for registering a custom signer and another to specify usage.</p>
<div class="section">
<h4><a name="Registering_Custom_Signers"></a>Registering Custom Signers</h4>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.custom.signers&lt;/name&gt;
&lt;value&gt;comma separated list of signers&lt;/value&gt;
&lt;!-- Example
&lt;value&gt;AWS4SignerType,CS1:CS1ClassName,CS2:CS2ClassName:CS2InitClass&lt;/value&gt;
--&gt;
&lt;/property&gt;
</pre></div></div>
<p>Acceptable value for each custom signer</p>
<p><tt>SignerName</tt>- this is used in case one of the default signers is being used. (E.g <tt>AWS4SignerType</tt>, <tt>QueryStringSignerType</tt>, <tt>AWSS3V4SignerType</tt>). If no custom signers are being used - this value does not need to be set.</p>
<p><tt>SignerName:SignerClassName</tt> - register a new signer with the specified name, and the class for this signer. The Signer Class must implement <tt>com.amazonaws.auth.Signer</tt>.</p>
<p><tt>SignerName:SignerClassName:SignerInitializerClassName</tt> - similar time above except also allows for a custom SignerInitializer (<tt>org.apache.hadoop.fs.s3a.AwsSignerInitializer</tt>) class to be specified.</p></div>
<div class="section">
<h4><a name="Usage_of_the_Signers"></a>Usage of the Signers</h4>
<p>Signers can be set at a per service level(S3, dynamodb, etc) or a common signer for all services.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.s3.signing-algorithm&lt;/name&gt;
&lt;value&gt;${S3SignerName}&lt;/value&gt;
&lt;description&gt;Specify the signer for S3&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.ddb.signing-algorithm&lt;/name&gt;
&lt;value&gt;${DdbSignerName}&lt;/value&gt;
&lt;description&gt;Specify the signer for DDB&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.signing-algorithm&lt;/name&gt;
&lt;value&gt;${SignerName}&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>For a specific service, the service specific signer is looked up first. If that is not specified, the common signer is looked up. If this is not specified as well, SDK settings are used.</p></div></div></div>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2021
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>