blob: f23ec4f75578c5ee13b47e645d5078ca78c5f2c1 [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2021-06-15
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop Amazon Web Services support &#x2013; Testing the S3A filesystem client and its features, including S3Guard</title>
<style type="text/css" media="all">
@import url("../../css/maven-base.css");
@import url("../../css/maven-theme.css");
@import url("../../css/site.css");
</style>
<link rel="stylesheet" href="../../css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20210615" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xleft">
<a href="http://www.apache.org/" class="externalLink">Apache</a>
&gt;
<a href="http://hadoop.apache.org/" class="externalLink">Hadoop</a>
&gt;
<a href="../../index.html">Apache Hadoop Amazon Web Services support</a>
&gt;
Testing the S3A filesystem client and its features, including S3Guard
</div>
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
&nbsp;| Last Published: 2021-06-15
&nbsp;| Version: 3.3.1
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../../../index.html">Overview</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../../../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../../../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../../../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../../../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../../../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../../../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../../../hadoop-openstack/index.html">OpenStack Swift</a>
</li>
<li class="none">
<a href="../../../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../../../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../../../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../../../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../../../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../../../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../../../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../../../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../../../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../../../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../../../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../../../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../../../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../../../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../../../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../../../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="../../images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>Testing the S3A filesystem client and its features, including S3Guard</h1>
<ul>
<li><a href="#Policy_for_submitting_patches_which_affect_the_hadoop-aws_module."> Policy for submitting patches which affect the hadoop-aws module.</a>
<ul>
<li><a href="#The_submitter_of_any_patch_is_required_to_run_all_the_integration_tests_and_declare_which_S3_region.2Fimplementation_they_used.">The submitter of any patch is required to run all the integration tests and declare which S3 region/implementation they used.</a></li>
<li><a href="#What_if_there.E2.80.99s_an_intermittent_failure_of_a_test.3F">What if there&#x2019;s an intermittent failure of a test?</a></li>
<li><a href="#What_if_the_tests_are_timing_out_or_failing_over_my_network_connection.3F">What if the tests are timing out or failing over my network connection?</a></li></ul></li>
<li><a href="#Setting_up_the_tests"> Setting up the tests</a>
<ul>
<li><a href="#File_core-site.xml">File core-site.xml</a></li>
<li><a href="#File_auth-keys.xml">File auth-keys.xml</a></li>
<li><a href="#Configuring_S3a_Encryption"> Configuring S3a Encryption</a></li>
<li><a href="#Default_Encryption"> Default Encryption</a></li></ul></li>
<li><a href="#Running_the_Tests"> Running the Tests</a>
<ul>
<li><a href="#Testing_against_different_regions"> Testing against different regions</a></li>
<li><a href="#CSV_Data_Tests"> CSV Data Tests</a></li></ul></li>
<li><a href="#Viewing_Integration_Test_Reports"> Viewing Integration Test Reports</a></li>
<li><a href="#Testing_Versioned_Stores"> Testing Versioned Stores</a></li>
<li><a href="#Testing_Different_Marker_Retention_Policy"> Testing Different Marker Retention Policy</a>
<ul>
<li><a href="#Enabling_auditing_of_markers">Enabling auditing of markers</a></li></ul></li>
<li><a href="#Scale_Tests"> Scale Tests</a>
<ul>
<li><a href="#Enabling_the_Scale_Tests"> Enabling the Scale Tests</a></li>
<li><a href="#Tuning_scale_options_from_Maven"> Tuning scale options from Maven</a></li>
<li><a href="#Scale_test_configuration_options"> Scale test configuration options</a></li></ul></li>
<li><a href="#Load_tests."> Load tests.</a></li>
<li><a href="#Testing_against_non_AWS_S3_endpoints."> Testing against non AWS S3 endpoints.</a>
<ul>
<li><a href="#Disabling_the_encryption_tests">Disabling the encryption tests</a></li>
<li><a href="#Configuring_the_CSV_file_read_tests.2A.2A">Configuring the CSV file read tests**</a></li>
<li><a href="#Turning_off_S3_Select">Turning off S3 Select</a></li>
<li><a href="#Testing_Session_Credentials">Testing Session Credentials</a></li></ul></li>
<li><a href="#Debugging_Test_failures"> Debugging Test failures</a></li>
<li><a href="#Adding_new_tests"> Adding new tests</a></li>
<li><a href="#Requirements_of_new_Tests"> Requirements of new Tests</a>
<ul>
<li><a href="#Subclasses_Existing_Shared_Base_Classes">Subclasses Existing Shared Base Classes</a></li>
<li><a href="#Secure">Secure</a></li>
<li><a href="#Efficient_of_Time_and_Money">Efficient of Time and Money</a></li>
<li><a href="#Works_With_Other_S3_Endpoints">Works With Other S3 Endpoints</a></li>
<li><a href="#Works_Over_Long-haul_Links">Works Over Long-haul Links</a></li>
<li><a href="#Provides_Diagnostics_and_timing_information">Provides Diagnostics and timing information</a></li>
<li><a href="#Fails_Meaningfully">Fails Meaningfully</a></li>
<li><a href="#Sets_up_its_filesystem_and_checks_for_those_settings">Sets up its filesystem and checks for those settings</a></li>
<li><a href="#Cleans_Up_Afterwards">Cleans Up Afterwards</a></li>
<li><a href="#Works_Reliably">Works Reliably</a></li>
<li><a href="#Runs_in_parallel_unless_this_is_unworkable.">Runs in parallel unless this is unworkable.</a></li></ul></li>
<li><a href="#Individual_test_cases_can_be_run_in_an_IDE">Individual test cases can be run in an IDE</a>
<ul>
<li><a href="#Keeping_AWS_Costs_down">Keeping AWS Costs down</a></li></ul></li>
<li><a href="#Tips"> Tips</a>
<ul>
<li><a href="#How_to_keep_your_credentials_really_safe">How to keep your credentials really safe</a></li></ul></li>
<li><a href="#Failure_Injection">Failure Injection</a></li>
<li><a href="#Simulating_List_Inconsistencies">Simulating List Inconsistencies</a>
<ul>
<li><a href="#Enabling_the_InconsistentAmazonS3CClient">Enabling the InconsistentAmazonS3CClient</a></li>
<li><a href="#Limitations_of_Inconsistency_Injection">Limitations of Inconsistency Injection</a></li>
<li><a href="#Using_the_InconsistentAmazonS3CClient_in_downstream_integration_tests">Using the InconsistentAmazonS3CClient in downstream integration tests</a></li></ul></li>
<li><a href="#Testing_S3Guard"> Testing S3Guard</a>
<ul>
<li><a href="#Testing_S3A_with_S3Guard_Enabled">Testing S3A with S3Guard Enabled</a></li>
<li><a href="#Notes">Notes</a></li>
<li><a href="#How_to_Dump_the_Table_and_Metastore_State">How to Dump the Table and Metastore State</a></li>
<li><a href="#Resetting_the_Metastore:_PurgeS3GuardDynamoTable">Resetting the Metastore: PurgeS3GuardDynamoTable</a></li>
<li><a href="#Scale_Testing_MetadataStore_Directly">Scale Testing MetadataStore Directly</a></li>
<li><a href="#Testing_encrypted_DynamoDB_tables">Testing encrypted DynamoDB tables</a></li>
<li><a href="#Testing_only:_Local_Metadata_Store">Testing only: Local Metadata Store</a></li></ul></li>
<li><a href="#Testing_Assumed_Roles"> Testing Assumed Roles</a></li>
<li><a href="#Qualifying_an_AWS_SDK_Update"> Qualifying an AWS SDK Update</a>
<ul>
<li><a href="#Basic_command_line_regression_testing">Basic command line regression testing</a></li>
<li><a href="#Other_tests">Other tests</a></li>
<li><a href="#Dealing_with_Deprecated_APIs_and_New_Features">Dealing with Deprecated APIs and New Features</a></li>
<li><a href="#Committing_the_patch">Committing the patch</a></li></ul></li></ul>
<p>This module includes both unit tests, which can run in isolation without connecting to the S3 service, and integration tests, which require a working connection to S3 to interact with a bucket. Unit test suites follow the naming convention <tt>Test*.java</tt>. Integration tests follow the naming convention <tt>ITest*.java</tt>.</p>
<p>Due to eventual consistency, integration tests may fail without reason. Transient failures, which no longer occur upon rerunning the test, should thus be ignored.</p>
<div class="section">
<h2><a name="Policy_for_submitting_patches_which_affect_the_hadoop-aws_module."></a><a name="policy"></a> Policy for submitting patches which affect the <tt>hadoop-aws</tt> module.</h2>
<p>The Apache Jenkins infrastructure does not run any S3 integration tests, due to the need to keep credentials secure.</p>
<div class="section">
<h3><a name="The_submitter_of_any_patch_is_required_to_run_all_the_integration_tests_and_declare_which_S3_region.2Fimplementation_they_used."></a>The submitter of any patch is required to run all the integration tests and declare which S3 region/implementation they used.</h3>
<p>This is important: <b>patches which do not include this declaration will be ignored</b></p>
<p>This policy has proven to be the only mechanism to guarantee full regression testing of code changes. Why the declaration of region? Two reasons</p>
<ol style="list-style-type: decimal">
<li>It helps us identify regressions which only surface against specific endpoints or third-party implementations of the S3 protocol.</li>
<li>It forces the submitters to be more honest about their testing. It&#x2019;s easy to lie, &#x201c;yes, I tested this&#x201d;. To say &#x201c;yes, I tested this against S3 US-west&#x201d; is a more specific lie and harder to make. And, if you get caught out: you lose all credibility with the project.</li>
</ol>
<p>You don&#x2019;t need to test from a VM within the AWS infrastructure; with the <tt>-Dparallel=tests</tt> option the non-scale tests complete in under ten minutes. Because the tests clean up after themselves, they are also designed to be low cost. It&#x2019;s neither hard nor expensive to run the tests; if you can&#x2019;t, there&#x2019;s no guarantee your patch works. The reviewers have enough to do, and don&#x2019;t have the time to do these tests, especially as every failure will simply make for a slow iterative development.</p>
<p>Please: run the tests. And if you don&#x2019;t, we are sorry for declining your patch, but we have to.</p></div>
<div class="section">
<h3><a name="What_if_there.E2.80.99s_an_intermittent_failure_of_a_test.3F"></a>What if there&#x2019;s an intermittent failure of a test?</h3>
<p>Some of the tests do fail intermittently, especially in parallel runs. If this happens, try to run the test on its own to see if the test succeeds.</p>
<p>If it still fails, include this fact in your declaration. We know some tests are intermittently unreliable.</p></div>
<div class="section">
<h3><a name="What_if_the_tests_are_timing_out_or_failing_over_my_network_connection.3F"></a>What if the tests are timing out or failing over my network connection?</h3>
<p>The tests and the S3A client are designed to be configurable for different timeouts. If you are seeing problems and this configuration isn&#x2019;t working, that&#x2019;s a sign of the configuration mechanism isn&#x2019;t complete. If it&#x2019;s happening in the production code, that could be a sign of a problem which may surface over long-haul connections. Please help us identify and fix these problems &#x2014; especially as you are the one best placed to verify the fixes work.</p></div></div>
<div class="section">
<h2><a name="Setting_up_the_tests"></a><a name="setting-up"></a> Setting up the tests</h2>
<p>To integration test the S3* filesystem clients, you need to provide <tt>auth-keys.xml</tt> which passes in authentication details to the test runner.</p>
<p>It is a Hadoop XML configuration file, which must be placed into <tt>hadoop-tools/hadoop-aws/src/test/resources</tt>.</p>
<div class="section">
<h3><a name="File_core-site.xml"></a>File <tt>core-site.xml</tt></h3>
<p>This file pre-exists and sources the configurations created under <tt>auth-keys.xml</tt>.</p>
<p>For most purposes you will not need to edit this file unless you need to apply a specific, non-default property change during the tests.</p></div>
<div class="section">
<h3><a name="File_auth-keys.xml"></a>File <tt>auth-keys.xml</tt></h3>
<p>The presence of this file triggers the testing of the S3 classes.</p>
<p>Without this file, <i>none of the integration tests in this module will be executed</i>.</p>
<p>The XML file must contain all the ID/key information needed to connect each of the filesystem clients to the object stores, and a URL for each filesystem for its testing.</p>
<ol style="list-style-type: decimal">
<li><tt>test.fs.s3a.name</tt> : the URL of the bucket for S3a tests</li>
<li><tt>fs.contract.test.fs.s3a</tt> : the URL of the bucket for S3a filesystem contract tests</li>
</ol>
<p>The contents of the bucket will be destroyed during the test process: do not use the bucket for any purpose other than testing. Furthermore, for s3a, all in-progress multi-part uploads to the bucket will be aborted at the start of a test (by forcing <tt>fs.s3a.multipart.purge=true</tt>) to clean up the temporary state of previously failed tests.</p>
<p>Example:</p>
<div>
<div>
<pre class="source">&lt;configuration&gt;
&lt;property&gt;
&lt;name&gt;test.fs.s3a.name&lt;/name&gt;
&lt;value&gt;s3a://test-aws-s3a/&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.contract.test.fs.s3a&lt;/name&gt;
&lt;value&gt;${test.fs.s3a.name}&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.access.key&lt;/name&gt;
&lt;description&gt;AWS access key ID. Omit for IAM role-based authentication.&lt;/description&gt;
&lt;value&gt;DONOTCOMMITTHISKEYTOSCM&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.secret.key&lt;/name&gt;
&lt;description&gt;AWS secret key. Omit for IAM role-based authentication.&lt;/description&gt;
&lt;value&gt;DONOTEVERSHARETHISSECRETKEY!&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;test.sts.endpoint&lt;/name&gt;
&lt;description&gt;Specific endpoint to use for STS requests.&lt;/description&gt;
&lt;value&gt;sts.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
&lt;/configuration&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="Configuring_S3a_Encryption"></a><a name="encryption"></a> Configuring S3a Encryption</h3>
<p>For S3a encryption tests to run correctly, the <tt>fs.s3a.server-side-encryption.key</tt> must be configured in the s3a contract xml file or <tt>auth-keys.xml</tt> file with a AWS KMS encryption key arn as this value is different for each AWS KMS. Please note this KMS key should be created in the same region as your S3 bucket. Otherwise, you may get <tt>KMS.NotFoundException</tt>.</p>
<p>Example:</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.server-side-encryption.key&lt;/name&gt;
&lt;value&gt;arn:aws:kms:us-west-2:360379543683:key/071a86ff-8881-4ba0-9230-95af6d01ca01&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>You can also force all the tests to run with a specific SSE encryption method by configuring the property <tt>fs.s3a.server-side-encryption-algorithm</tt> in the s3a contract file.</p></div>
<div class="section">
<h3><a name="Default_Encryption"></a><a name="default_encyption"></a> Default Encryption</h3>
<p>Buckets can be configured with <a class="externalLink" href="https://docs.aws.amazon.com/AmazonS3/latest/dev/bucket-encryption.html">default encryption</a> on the AWS side. Some S3AFileSystem tests are skipped when default encryption is enabled due to unpredictability in how <a class="externalLink" href="https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html">ETags</a> are generated.</p></div></div>
<div class="section">
<h2><a name="Running_the_Tests"></a><a name="running"></a> Running the Tests</h2>
<p>After completing the configuration, execute the test run through Maven.</p>
<div>
<div>
<pre class="source">mvn clean verify
</pre></div></div>
<p>It&#x2019;s also possible to execute multiple test suites in parallel by passing the <tt>parallel-tests</tt> property on the command line. The tests spend most of their time blocked on network I/O with the S3 service, so running in parallel tends to complete full test runs faster.</p>
<div>
<div>
<pre class="source">mvn -Dparallel-tests clean verify
</pre></div></div>
<p>Some tests must run with exclusive access to the S3 bucket, so even with the <tt>parallel-tests</tt> property, several test suites will run in serial in a separate Maven execution step after the parallel tests.</p>
<p>By default, <tt>parallel-tests</tt> runs 4 test suites concurrently. This can be tuned by passing the <tt>testsThreadCount</tt> property.</p>
<div>
<div>
<pre class="source">mvn -Dparallel-tests -DtestsThreadCount=8 clean verify
</pre></div></div>
<p>To run just unit tests, which do not require S3 connectivity or AWS credentials, use any of the above invocations, but switch the goal to <tt>test</tt> instead of <tt>verify</tt>.</p>
<div>
<div>
<pre class="source">mvn clean test
mvn -Dparallel-tests clean test
mvn -Dparallel-tests -DtestsThreadCount=8 clean test
</pre></div></div>
<p>To run only a specific named subset of tests, pass the <tt>test</tt> property for unit tests or the <tt>it.test</tt> property for integration tests.</p>
<div>
<div>
<pre class="source">mvn clean test -Dtest=TestS3AInputPolicies
mvn clean verify -Dit.test=ITestS3AFileContextStatistics -Dtest=none
mvn clean verify -Dtest=TestS3A* -Dit.test=ITestS3A*
</pre></div></div>
<p>Note that when running a specific subset of tests, the patterns passed in <tt>test</tt> and <tt>it.test</tt> override the configuration of which tests need to run in isolation in a separate serial phase (mentioned above). This can cause unpredictable results, so the recommendation is to avoid passing <tt>parallel-tests</tt> in combination with <tt>test</tt> or <tt>it.test</tt>. If you know that you are specifying only tests that can run safely in parallel, then it will work. For wide patterns, like <tt>ITestS3A*</tt> shown above, it may cause unpredictable test failures.</p>
<div class="section">
<h3><a name="Testing_against_different_regions"></a><a name="regions"></a> Testing against different regions</h3>
<p>S3A can connect to different regions &#x2014;the tests support this. Simply define the target region in <tt>auth-keys.xml</tt>.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.endpoint&lt;/name&gt;
&lt;value&gt;s3.eu-central-1.amazonaws.com&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Alternatively you can use endpoints defined in <a href="../../../../test/resources/core-site.xml">core-site.xml</a>.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.endpoint&lt;/name&gt;
&lt;value&gt;${frankfurt.endpoint}&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>This is used for all tests expect for scale tests using a Public CSV.gz file (see below)</p></div>
<div class="section">
<h3><a name="CSV_Data_Tests"></a><a name="csv"></a> CSV Data Tests</h3>
<p>The <tt>TestS3AInputStreamPerformance</tt> tests require read access to a multi-MB text file. The default file for these tests is one published by amazon, <a class="externalLink" href="http://landsat-pds.s3.amazonaws.com/scene_list.gz">s3a://landsat-pds.s3.amazonaws.com/scene_list.gz</a>. This is a gzipped CSV index of other files which amazon serves for open use.</p>
<p>The path to this object is set in the option <tt>fs.s3a.scale.test.csvfile</tt>,</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.scale.test.csvfile&lt;/name&gt;
&lt;value&gt;s3a://landsat-pds/scene_list.gz&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<ol style="list-style-type: decimal">
<li>If the option is not overridden, the default value is used. This is hosted in Amazon&#x2019;s US-east datacenter.</li>
<li>If <tt>fs.s3a.scale.test.csvfile</tt> is empty, tests which require it will be skipped.</li>
<li>If the data cannot be read for any reason then the test will fail.</li>
<li>If the property is set to a different path, then that data must be readable and &#x201c;sufficiently&#x201d; large.</li>
</ol>
<p>(the reason the space or newline is needed is to add &#x201c;an empty entry&#x201d;; an empty <tt>&lt;value/&gt;</tt> would be considered undefined and pick up the default)</p>
<p>Of using a test file in an S3 region requiring a different endpoint value set in <tt>fs.s3a.endpoint</tt>, a bucket-specific endpoint must be defined. For the default test dataset, hosted in the <tt>landsat-pds</tt> bucket, this is:</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.bucket.landsat-pds.endpoint&lt;/name&gt;
&lt;value&gt;s3.amazonaws.com&lt;/value&gt;
&lt;description&gt;The endpoint for s3a://landsat-pds URLs&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div></div>
<div class="section">
<h2><a name="Viewing_Integration_Test_Reports"></a><a name="reporting"></a> Viewing Integration Test Reports</h2>
<p>Integration test results and logs are stored in <tt>target/failsafe-reports/</tt>. An HTML report can be generated during site generation, or with the <tt>surefire-report</tt> plugin:</p>
<div>
<div>
<pre class="source">mvn surefire-report:failsafe-report-only
</pre></div></div>
</div>
<div class="section">
<h2><a name="Testing_Versioned_Stores"></a><a name="versioning"></a> Testing Versioned Stores</h2>
<p>Some tests (specifically some in <tt>ITestS3ARemoteFileChanged</tt>) require a versioned bucket for full test coverage as well as S3Guard being enabled.</p>
<p>To enable versioning in a bucket.</p>
<ol style="list-style-type: decimal">
<li>In the AWS S3 Management console find and select the bucket.</li>
<li>In the Properties &#x201c;tab&#x201d;, set it as versioned.</li>
<li><i>Important</i> Create a lifecycle rule to automatically clean up old versions after 24h. This avoids running up bills for objects which tests runs create and then delete.</li>
<li>Run the tests again.</li>
</ol>
<p>Once a bucket is converted to being versioned, it cannot be converted back to being unversioned.</p></div>
<div class="section">
<h2><a name="Testing_Different_Marker_Retention_Policy"></a><a name="marker"></a> Testing Different Marker Retention Policy</h2>
<p>Hadoop supports <a href="directory_markers.html">different policies for directory marker retention</a> -essentially the classic &#x201c;delete&#x201d; and the higher-performance &#x201c;keep&#x201d; options; &#x201c;authoritative&#x201d; is just &#x201c;keep&#x201d; restricted to a part of the bucket.</p>
<p>Example: test with <tt>markers=delete</tt></p>
<div>
<div>
<pre class="source">mvn verify -Dparallel-tests -DtestsThreadCount=4 -Dmarkers=delete
</pre></div></div>
<p>Example: test with <tt>markers=keep</tt></p>
<div>
<div>
<pre class="source">mvn verify -Dparallel-tests -DtestsThreadCount=4 -Dmarkers=keep
</pre></div></div>
<p>Example: test with <tt>markers=authoritative</tt></p>
<div>
<div>
<pre class="source">mvn verify -Dparallel-tests -DtestsThreadCount=4 -Dmarkers=authoritative
</pre></div></div>
<p>This final option is of limited use unless paths in the bucket have actually been configured to be of mixed status; unless anything is set up then the outcome should equal that of &#x201c;delete&#x201d;</p>
<div class="section">
<h3><a name="Enabling_auditing_of_markers"></a>Enabling auditing of markers</h3>
<p>To enable an audit of the output directory of every test suite, enable the option <tt>fs.s3a.directory.marker.audit</tt></p>
<div>
<div>
<pre class="source">-Dfs.s3a.directory.marker.audit=true
</pre></div></div>
<p>When set, if the marker policy is to delete markers under the test output directory, then the marker tool audit command will be run. This will fail if a marker was found.</p>
<p>This adds extra overhead to every operation, but helps verify that the connector is not keeping markers where it needs to be deleting them -and hence backwards compatibility is maintained.</p></div></div>
<div class="section">
<h2><a name="Scale_Tests"></a><a name="scale"></a> Scale Tests</h2>
<p>There are a set of tests designed to measure the scalability and performance at scale of the S3A tests, <i>Scale Tests</i>. Tests include: creating and traversing directory trees, uploading large files, renaming them, deleting them, seeking through the files, performing random IO, and others. This makes them a foundational part of the benchmarking.</p>
<p>By their very nature they are slow. And, as their execution time is often limited by bandwidth between the computer running the tests and the S3 endpoint, parallel execution does not speed these tests up.</p>
<p><b><i>Note: Running scale tests with <tt>-Ds3guard</tt> and <tt>-Ddynamo</tt> requires that you use a private, testing-only DynamoDB table.</i></b> The tests do disruptive things such as deleting metadata and setting the provisioned throughput to very low values.</p>
<div class="section">
<h3><a name="Enabling_the_Scale_Tests"></a><a name="enabling-scale"></a> Enabling the Scale Tests</h3>
<p>The tests are enabled if the <tt>scale</tt> property is set in the maven build this can be done regardless of whether or not the parallel test profile is used</p>
<div>
<div>
<pre class="source">mvn verify -Dscale
mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8
</pre></div></div>
<p>The most bandwidth intensive tests (those which upload data) always run sequentially; those which are slow due to HTTPS setup costs or server-side actions are included in the set of parallelized tests.</p></div>
<div class="section">
<h3><a name="Tuning_scale_options_from_Maven"></a><a name="tuning_scale"></a> Tuning scale options from Maven</h3>
<p>Some of the tests can be tuned from the maven build or from the configuration file used to run the tests.</p>
<div>
<div>
<pre class="source">mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8 -Dfs.s3a.scale.test.huge.filesize=128M
</pre></div></div>
<p>The algorithm is</p>
<ol style="list-style-type: decimal">
<li>The value is queried from the configuration file, using a default value if it is not set.</li>
<li>The value is queried from the JVM System Properties, where it is passed down by maven.</li>
<li>If the system property is null, an empty string, or it has the value <tt>unset</tt>, then the configuration value is used. The <tt>unset</tt> option is used to <a class="externalLink" href="http://stackoverflow.com/questions/7773134/null-versus-empty-arguments-in-maven">work round a quirk in maven property propagation</a>.</li>
</ol>
<p>Only a few properties can be set this way; more will be added.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> Property </th>
<th> Meaning </th></tr>
</thead><tbody>
<tr class="b">
<td> <tt>fs.s3a.scale.test.timeout</tt></td>
<td> Timeout in seconds for scale tests </td></tr>
<tr class="a">
<td> <tt>fs.s3a.scale.test.huge.filesize</tt></td>
<td> Size for huge file uploads </td></tr>
<tr class="b">
<td> <tt>fs.s3a.scale.test.huge.huge.partitionsize</tt></td>
<td> Size for partitions in huge file uploads </td></tr>
</tbody>
</table>
<p>The file and partition sizes are numeric values with a k/m/g/t/p suffix depending on the desired size. For example: 128M, 128m, 2G, 2G, 4T or even 1P.</p></div>
<div class="section">
<h3><a name="Scale_test_configuration_options"></a><a name="scale-config"></a> Scale test configuration options</h3>
<p>Some scale tests perform multiple operations (such as creating many directories).</p>
<p>The exact number of operations to perform is configurable in the option <tt>scale.test.operation.count</tt></p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;scale.test.operation.count&lt;/name&gt;
&lt;value&gt;10&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Larger values generate more load, and are recommended when testing locally, or in batch runs.</p>
<p>Smaller values results in faster test runs, especially when the object store is a long way away.</p>
<p>Operations which work on directories have a separate option: this controls the width and depth of tests creating recursive directories. Larger values create exponentially more directories, with consequent performance impact.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;scale.test.directory.count&lt;/name&gt;
&lt;value&gt;2&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>DistCp tests targeting S3A support a configurable file size. The default is 10 MB, but the configuration value is expressed in KB so that it can be tuned smaller to achieve faster test runs.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;scale.test.distcp.file.size.kb&lt;/name&gt;
&lt;value&gt;10240&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>S3A specific scale test properties are</p>
<p><i><tt>fs.s3a.scale.test.huge.filesize</tt>: size in MB for &#x201c;Huge file tests&#x201d;.</i></p>
<p>The Huge File tests validate S3A&#x2019;s ability to handle large files &#x2014;the property <tt>fs.s3a.scale.test.huge.filesize</tt> declares the file size to use.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.scale.test.huge.filesize&lt;/name&gt;
&lt;value&gt;200M&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Amazon S3 handles files larger than 5GB differently than smaller ones. Setting the huge filesize to a number greater than that) validates support for huge files.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.scale.test.huge.filesize&lt;/name&gt;
&lt;value&gt;6G&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Tests at this scale are slow: they are best executed from hosts running in the cloud infrastructure where the S3 endpoint is based. Otherwise, set a large timeout in <tt>fs.s3a.scale.test.timeout</tt></p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.scale.test.timeout&lt;/name&gt;
&lt;value&gt;432000&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The tests are executed in an order to only clean up created files after the end of all the tests. If the tests are interrupted, the test data will remain.</p></div></div>
<div class="section">
<h2><a name="Load_tests."></a><a name="alternate_s3"></a> Load tests.</h2>
<p>Some are designed to overload AWS services with more requests per second than an AWS account is permitted.</p>
<p>The operation of these test maybe observable to other users of the same account -especially if they are working in the AWS region to which the tests are targeted.</p>
<p>There may also run up larger bills.</p>
<p>These tests all have the prefix <tt>ILoadTest</tt></p>
<p>They do not run automatically: they must be explicitly run from the command line or an IDE.</p>
<p>Look in the source for these and reads the Javadocs before executing.</p></div>
<div class="section">
<h2><a name="Testing_against_non_AWS_S3_endpoints."></a><a name="alternate_s3"></a> Testing against non AWS S3 endpoints.</h2>
<p>The S3A filesystem is designed to work with storage endpoints which implement the S3 protocols to the extent that the amazon S3 SDK is capable of talking to it. We encourage testing against other filesystems and submissions of patches which address issues. In particular, we encourage testing of Hadoop release candidates, as these third-party endpoints get even less testing than the S3 endpoint itself.</p>
<div class="section">
<h3><a name="Disabling_the_encryption_tests"></a>Disabling the encryption tests</h3>
<p>If the endpoint doesn&#x2019;t support server-side-encryption, these will fail. They can be turned off.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;test.fs.s3a.encryption.enabled&lt;/name&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Encryption is only used for those specific test suites with <tt>Encryption</tt> in their classname.</p></div>
<div class="section">
<h3><a name="Configuring_the_CSV_file_read_tests.2A.2A"></a>Configuring the CSV file read tests**</h3>
<p>To test on alternate infrastructures supporting the same APIs, the option <tt>fs.s3a.scale.test.csvfile</tt> must either be set to &quot; &quot;, or an object of at least 10MB is uploaded to the object store, and the <tt>fs.s3a.scale.test.csvfile</tt> option set to its path.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.scale.test.csvfile&lt;/name&gt;
&lt;value&gt; &lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>(yes, the space is necessary. The Hadoop <tt>Configuration</tt> class treats an empty value as &#x201c;do not override the default&#x201d;).</p></div>
<div class="section">
<h3><a name="Turning_off_S3_Select"></a>Turning off S3 Select</h3>
<p>The S3 select tests are skipped when the S3 endpoint doesn&#x2019;t support S3 Select.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.select.enabled&lt;/name&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>If your endpoint doesn&#x2019;t support that feature, this option should be in your <tt>core-site.xml</tt> file, so that trying to use S3 select fails fast with a meaningful error (&#x201c;S3 Select not supported&#x201d;) rather than a generic Bad Request exception.</p></div>
<div class="section">
<h3><a name="Testing_Session_Credentials"></a>Testing Session Credentials</h3>
<p>Some tests requests a session credentials and assumed role credentials from the AWS Secure Token Service, then use them to authenticate with S3 either directly or via delegation tokens.</p>
<p>If an S3 implementation does not support STS, then these functional test cases must be disabled:</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;test.fs.s3a.sts.enabled&lt;/name&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>These tests request a temporary set of credentials from the STS service endpoint. An alternate endpoint may be defined in <tt>fs.s3a.assumed.role.sts.endpoint</tt>. If this is set, a delegation token region must also be defined: in <tt>fs.s3a.assumed.role.sts.endpoint.region</tt>. This is useful not just for testing alternative infrastructures, but to reduce latency on tests executed away from the central service.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.delegation.token.endpoint&lt;/name&gt;
&lt;value&gt;fs.s3a.assumed.role.sts.endpoint&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.assumed.role.sts.endpoint.region&lt;/name&gt;
&lt;value&gt;eu-west-2&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The default is &quot;&quot;; meaning &#x201c;use the amazon default endpoint&#x201d; (<tt>sts.amazonaws.com</tt>).</p>
<p>Consult the <a class="externalLink" href="https://docs.aws.amazon.com/general/latest/gr/rande.html#sts_region">AWS documentation</a> for the full list of locations.</p></div></div>
<div class="section">
<h2><a name="Debugging_Test_failures"></a><a name="debugging"></a> Debugging Test failures</h2>
<p>Logging at debug level is the standard way to provide more diagnostics output; after setting this rerun the tests</p>
<div>
<div>
<pre class="source">log4j.logger.org.apache.hadoop.fs.s3a=DEBUG
</pre></div></div>
<p>There are also some logging options for debug logging of the AWS client</p>
<div>
<div>
<pre class="source">log4j.logger.com.amazonaws=DEBUG
log4j.logger.com.amazonaws.http.conn.ssl=INFO
log4j.logger.com.amazonaws.internal=INFO
</pre></div></div>
<p>There is also the option of enabling logging on a bucket; this could perhaps be used to diagnose problems from that end. This isn&#x2019;t something actively used, but remains an option. If you are forced to debug this way, consider setting the <tt>fs.s3a.user.agent.prefix</tt> to a unique prefix for a specific test run, which will enable the specific log entries to be more easily located.</p></div>
<div class="section">
<h2><a name="Adding_new_tests"></a><a name="new_tests"></a> Adding new tests</h2>
<p>New tests are always welcome. Bear in mind that we need to keep costs and test time down, which is done by</p>
<ul>
<li>Not duplicating tests.</li>
<li>Being efficient in your use of Hadoop API calls.</li>
<li>Isolating large/slow tests into the &#x201c;scale&#x201d; test group.</li>
<li>Designing all tests to execute in parallel (where possible).</li>
<li>Adding new probes and predicates into existing tests, albeit carefully.</li>
</ul>
<p><i>No duplication</i>: if an operation is tested elsewhere, don&#x2019;t repeat it. This applies as much for metadata operations as it does for bulk IO. If a new test case is added which completely obsoletes an existing test, it is OK to cut the previous one &#x2014;after showing that coverage is not worsened.</p>
<p><i>Efficient</i>: prefer the <tt>getFileStatus()</tt> and examining the results, rather than call to <tt>exists()</tt>, <tt>isFile()</tt>, etc.</p>
<p><i>Isolating Scale tests</i>. Any S3A test doing large amounts of IO MUST extend the class <tt>S3AScaleTestBase</tt>, so only running if <tt>scale</tt> is defined on a build, supporting test timeouts configurable by the user. Scale tests should also support configurability as to the actual size of objects/number of operations, so that behavior at different scale can be verified.</p>
<p><i>Designed for parallel execution</i>. A key need here is for each test suite to work on isolated parts of the filesystem. Subclasses of <tt>AbstractS3ATestBase</tt> SHOULD use the <tt>path()</tt> method, with a base path of the test suite name, to build isolated paths. Tests MUST NOT assume that they have exclusive access to a bucket.</p>
<p><i>Extending existing tests where appropriate</i>. This recommendation goes against normal testing best practise of &#x201c;test one thing per method&#x201d;. Because it is so slow to create directory trees or upload large files, we do not have that luxury. All the tests against real S3 endpoints are integration tests where sharing test setup and teardown saves time and money.</p>
<p>A standard way to do this is to extend existing tests with some extra predicates, rather than write new tests. When doing this, make sure that the new predicates fail with meaningful diagnostics, so any new problems can be easily debugged from test logs.</p>
<p><b><i>Effective use of FS instances during S3A integration tests.</i></b> Tests using <tt>FileSystem</tt> instances are fastest if they can recycle the existing FS instance from the same JVM.</p>
<p>If you do that, you MUST NOT close or do unique configuration on them. If you want a guarantee of 100% isolation or an instance with unique config, create a new instance which you MUST close in the teardown to avoid leakage of resources.</p>
<p>Do NOT add <tt>FileSystem</tt> instances manually (with e.g <tt>org.apache.hadoop.fs.FileSystem#addFileSystemForTesting</tt>) to the cache that will be modified or closed during the test runs. This can cause other tests to fail when using the same modified or closed FS instance. For more details see HADOOP-15819.</p></div>
<div class="section">
<h2><a name="Requirements_of_new_Tests"></a><a name="requirements"></a> Requirements of new Tests</h2>
<p>This is what we expect from new tests; they&#x2019;re an extension of the normal Hadoop requirements, based on the need to work with remote servers whose use requires the presence of secret credentials, where tests may be slow, and where finding out why something failed from nothing but the test output is critical.</p>
<div class="section">
<h3><a name="Subclasses_Existing_Shared_Base_Classes"></a>Subclasses Existing Shared Base Classes</h3>
<p>Extend <tt>AbstractS3ATestBase</tt> or <tt>AbstractSTestS3AHugeFiles</tt> unless justifiable. These set things up for testing against the object stores, provide good threadnames, help generate isolated paths, and for <tt>AbstractSTestS3AHugeFiles</tt> subclasses, only run if <tt>-Dscale</tt> is set.</p>
<p>Key features of <tt>AbstractS3ATestBase</tt></p>
<ul>
<li><tt>getFileSystem()</tt> returns the S3A Filesystem bonded to the contract test Filesystem defined in <tt>fs.s3a.contract.test</tt></li>
<li>will automatically skip all tests if that URL is unset.</li>
<li>Extends <tt>AbstractFSContractTestBase</tt> and <tt>Assert</tt> for all their methods.</li>
</ul>
<p>Having shared base classes may help reduce future maintenance too. Please use them/</p></div>
<div class="section">
<h3><a name="Secure"></a>Secure</h3>
<p>Don&#x2019;t ever log credentials. The credential tests go out of their way to not provide meaningful logs or assertion messages precisely to avoid this.</p></div>
<div class="section">
<h3><a name="Efficient_of_Time_and_Money"></a>Efficient of Time and Money</h3>
<p>This means efficient in test setup/teardown, and, ideally, making use of existing public datasets to save setup time and tester cost.</p>
<p>Strategies of particular note are:</p>
<ol style="list-style-type: decimal">
<li><tt>ITestS3ADirectoryPerformance</tt>: a single test case sets up the directory tree then performs different list operations, measuring the time taken.</li>
<li><tt>AbstractSTestS3AHugeFiles</tt>: marks the test suite as <tt>@FixMethodOrder(MethodSorters.NAME_ASCENDING)</tt> then orders the test cases such that each test case expects the previous test to have completed (here: uploaded a file, renamed a file, &#x2026;). This provides for independent tests in the reports, yet still permits an ordered sequence of operations. Do note the use of <tt>Assume.assume()</tt> to detect when the preconditions for a single test case are not met, hence, the tests become skipped, rather than fail with a trace which is really a false alarm.</li>
</ol>
<p>The ordered test case mechanism of <tt>AbstractSTestS3AHugeFiles</tt> is probably the most elegant way of chaining test setup/teardown.</p>
<p>Regarding reusing existing data, we tend to use the landsat archive of AWS US-East for our testing of input stream operations. This doesn&#x2019;t work against other regions, or with third party S3 implementations. Thus the URL can be overridden for testing elsewhere.</p></div>
<div class="section">
<h3><a name="Works_With_Other_S3_Endpoints"></a>Works With Other S3 Endpoints</h3>
<p>Don&#x2019;t assume AWS S3 US-East only, do allow for working with external S3 implementations. Those may be behind the latest S3 API features, not support encryption, session APIs, etc.</p>
<p>They won&#x2019;t have the same CSV test files as some of the input tests rely on. Look at <tt>ITestS3AInputStreamPerformance</tt> to see how tests can be written to support the declaration of a specific large test file on alternate filesystems.</p></div>
<div class="section">
<h3><a name="Works_Over_Long-haul_Links"></a>Works Over Long-haul Links</h3>
<p>As well as making file size and operation counts scalable, this includes making test timeouts adequate. The Scale tests make this configurable; it&#x2019;s hard coded to ten minutes in <tt>AbstractS3ATestBase()</tt>; subclasses can change this by overriding <tt>getTestTimeoutMillis()</tt>.</p>
<p>Equally importantly: support proxies, as some testers need them.</p></div>
<div class="section">
<h3><a name="Provides_Diagnostics_and_timing_information"></a>Provides Diagnostics and timing information</h3>
<ol style="list-style-type: decimal">
<li>Give threads useful names.</li>
<li>Create logs, log things. Know that the <tt>S3AFileSystem</tt> and its input and output streams <i>all</i> provide useful statistics in their {{toString()}} calls; logging them is useful on its own.</li>
<li>you can use <tt>AbstractS3ATestBase.describe(format-stringm, args)</tt> here.; it adds some newlines so as to be easier to spot.</li>
<li>Use <tt>ContractTestUtils.NanoTimer</tt> to measure the duration of operations, and log the output.</li>
</ol></div>
<div class="section">
<h3><a name="Fails_Meaningfully"></a>Fails Meaningfully</h3>
<p>The <tt>ContractTestUtils</tt> class contains a whole set of assertions for making statements about the expected state of a filesystem, e.g. <tt>assertPathExists(FS, path)</tt>, <tt>assertPathDoesNotExists(FS, path)</tt>, and others. These do their best to provide meaningful diagnostics on failures (e.g. directory listings, file status, &#x2026;), so help make failures easier to understand.</p>
<p>At the very least, do not use <tt>assertTrue()</tt> or <tt>assertFalse()</tt> without including error messages.</p></div>
<div class="section">
<h3><a name="Sets_up_its_filesystem_and_checks_for_those_settings"></a>Sets up its filesystem and checks for those settings</h3>
<p>Tests can overrun <tt>createConfiguration()</tt> to add new options to the configuration file for the S3A Filesystem instance used in their tests.</p>
<p>However, filesystem caching may mean that a test suite may get a cached instance created with an different configuration. For tests which don&#x2019;t need specific configurations caching is good: it reduces test setup time.</p>
<p>For those tests which do need unique options (encryption, magic files), things can break, and they will do so in hard-to-replicate ways.</p>
<p>Use <tt>S3ATestUtils.disableFilesystemCaching(conf)</tt> to disable caching when modifying the config. As an example from <tt>AbstractTestS3AEncryption</tt>:</p>
<div>
<div>
<pre class="source">@Override
protected Configuration createConfiguration() {
Configuration conf = super.createConfiguration();
S3ATestUtils.disableFilesystemCaching(conf);
conf.set(Constants.SERVER_SIDE_ENCRYPTION_ALGORITHM,
getSSEAlgorithm().getMethod());
return conf;
}
</pre></div></div>
<p>Then verify in the setup method or test cases that their filesystem actually has the desired feature (<tt>fs.getConf().getProperty(...)</tt>). This not only catches filesystem reuse problems, it catches the situation where the filesystem configuration in <tt>auth-keys.xml</tt> has explicit per-bucket settings which override the test suite&#x2019;s general option settings.</p></div>
<div class="section">
<h3><a name="Cleans_Up_Afterwards"></a>Cleans Up Afterwards</h3>
<p>Keeps costs down.</p>
<ol style="list-style-type: decimal">
<li>Do not only cleanup if a test case completes successfully; test suite teardown must do it.</li>
<li>That teardown code must check for the filesystem and other fields being null before the cleanup. Why? If test setup fails, the teardown methods still get called.</li>
</ol></div>
<div class="section">
<h3><a name="Works_Reliably"></a>Works Reliably</h3>
<p>We really appreciate this &#x2014; you will too.</p></div>
<div class="section">
<h3><a name="Runs_in_parallel_unless_this_is_unworkable."></a>Runs in parallel unless this is unworkable.</h3>
<p>Tests must be designed to run in parallel with other tests, all working with the same shared S3 bucket. This means</p>
<ul>
<li>Uses relative and JVM-fork-unique paths provided by the method <tt>AbstractFSContractTestBase.path(String filepath)</tt>.</li>
<li>Doesn&#x2019;t manipulate the root directory or make assertions about its contents (for example: delete its contents and assert that it is now empty).</li>
<li>Doesn&#x2019;t have a specific requirement of all active clients of the bucket (example: SSE-C tests which require all files, even directory markers, to be encrypted with the same key).</li>
<li>Doesn&#x2019;t use so much bandwidth that all other tests will be starved of IO and start timing out (e.g. the scale tests).</li>
</ul>
<p>Tests such as these can only be run as sequential tests. When adding one, exclude it in the POM file. from the parallel failsafe run and add to the sequential one afterwards. The IO heavy ones must also be subclasses of <tt>S3AScaleTestBase</tt> and so only run if the system/maven property <tt>fs.s3a.scale.test.enabled</tt> is true.</p></div></div>
<div class="section">
<h2><a name="Individual_test_cases_can_be_run_in_an_IDE"></a>Individual test cases can be run in an IDE</h2>
<p>This is invaluable for debugging test failures.</p>
<p>How to set test options in your hadoop configuration rather than on the maven command line:</p>
<p>As an example let&#x2019;s assume you want to run S3Guard integration tests using IDE. Please add the following properties in <tt>hadoop-tools/hadoop-aws/src/test/resources/auth-keys.xml</tt> file. Local configuration is stored in auth-keys.xml. The changes to this file won&#x2019;t be committed, so it&#x2019;s safe to store local config here.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.s3guard.test.enabled&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.s3guard.test.implementation&lt;/name&gt;
&lt;value&gt;dynamo&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>Warning : Although this is easier for IDE debugging setups, once you do this, you cannot change configurations on the mvn command line, such as testing without s3guard.</p>
<div class="section">
<h3><a name="Keeping_AWS_Costs_down"></a>Keeping AWS Costs down</h3>
<p>Most of the base S3 tests are designed to use public AWS data (the landsat-pds bucket) for read IO, so you don&#x2019;t have to pay for bytes downloaded or long term storage costs. The scale tests do work with more data so will cost more as well as generally take more time to execute.</p>
<p>You are however billed for</p>
<ol style="list-style-type: decimal">
<li>Data left in S3 after test runs.</li>
<li>DynamoDB capacity reserved by S3Guard tables.</li>
<li>HTTP operations on files (HEAD, LIST, GET).</li>
<li>In-progress multipart uploads from bulk IO or S3A committer tests.</li>
<li>Encryption/decryption using AWS KMS keys.</li>
</ol>
<p>The GET/decrypt costs are incurred on each partial read of a file, so random IO can cost more than sequential IO; the speedup of queries with columnar data usually justifies this.</p>
<p>The DynamoDB costs come from the number of entries stores and the allocated capacity.</p>
<p>How to keep costs down</p>
<ul>
<li>Don&#x2019;t run the scale tests with large datasets; keep <tt>fs.s3a.scale.test.huge.filesize</tt> unset, or a few MB (minimum: 5).</li>
<li>Remove all files in the filesystem. The root tests usually do this, but it can be manually done:
<p>*hadoop fs -rm -r -f -skipTrash <a class="externalLink" href="s3a://test-bucket/">s3a://test-bucket/</a></p></li>
<li>Abort all outstanding uploads:
<p>hadoop s3guard uploads -abort -force <a class="externalLink" href="s3a://test-bucket/">s3a://test-bucket/</a></p></li>
<li>If you don&#x2019;t need it, destroy the S3Guard DDB table.
<p>hadoop s3guard destroy <a class="externalLink" href="s3a://test-bucket/">s3a://test-bucket/</a></p></li>
</ul>
<p>The S3Guard tests will automatically create the Dynamo DB table in runs with <tt>-Ds3guard -Ddynamo</tt> set; default capacity of these buckets tests is very small; it keeps costs down at the expense of IO performance and, for test runs in or near the S3/DDB stores, throttling events.</p>
<p>If you want to manage capacity, use <tt>s3guard set-capacity</tt> to increase it (performance) or decrease it (costs). For remote <tt>hadoop-aws</tt> test runs, the read/write capacities of &#x201c;0&#x201d; each should suffice; increase it if parallel test run logs warn of throttling.</p></div></div>
<div class="section">
<h2><a name="Tips"></a><a name="tips"></a> Tips</h2>
<div class="section">
<h3><a name="How_to_keep_your_credentials_really_safe"></a>How to keep your credentials really safe</h3>
<p>Although the <tt>auth-keys.xml</tt> file is marked as ignored in git and subversion, it is still in your source tree, and there&#x2019;s always that risk that it may creep out.</p>
<p>You can avoid this by keeping your keys outside the source tree and using an absolute XInclude reference to it.</p>
<div>
<div>
<pre class="source">&lt;configuration&gt;
&lt;include xmlns=&quot;http://www.w3.org/2001/XInclude&quot;
href=&quot;file:///users/ubuntu/.auth-keys.xml&quot; /&gt;
&lt;/configuration&gt;
</pre></div></div>
</div></div>
<div class="section">
<h2><a name="Failure_Injection"></a><a name="failure-injection"></a>Failure Injection</h2>
<p><b>Warning do not enable any type of failure injection in production. The following settings are for testing only.</b></p>
<p>One of the challenges with S3A integration tests is the fact that S3 was an eventually-consistent storage system. To simulate inconsistencies more frequently than they would normally surface, S3A supports a shim layer on top of the <tt>AmazonS3Client</tt> class which artificially delays certain paths from appearing in listings. This is implemented in the class <tt>InconsistentAmazonS3Client</tt>.</p>
<p>Now that S3 is consistent, injecting failures during integration and functional testing is less important. There&#x2019;s no need to enable it to verify that S3Guard can recover from consistencies, given that in production such consistencies will never surface.</p></div>
<div class="section">
<h2><a name="Simulating_List_Inconsistencies"></a>Simulating List Inconsistencies</h2>
<div class="section">
<h3><a name="Enabling_the_InconsistentAmazonS3CClient"></a>Enabling the InconsistentAmazonS3CClient</h3>
<p>There are two ways of enabling the <tt>InconsistentAmazonS3Client</tt>: at config-time, or programmatically. For an example of programmatic test usage, see <tt>ITestS3GuardListConsistency</tt>.</p>
<p>To enable the fault-injecting client via configuration, switch the S3A client to use the &#x201c;Inconsistent S3 Client Factory&#x201d; when connecting to S3:</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.s3.client.factory.impl&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.InconsistentS3ClientFactory&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The inconsistent client works by:</p>
<ol style="list-style-type: decimal">
<li>Choosing which objects will be &#x201c;inconsistent&#x201d; at the time the object is created or deleted.</li>
<li>When <tt>listObjects()</tt> is called, any keys that we have marked as inconsistent above will not be returned in the results (until the configured delay has elapsed). Similarly, deleted items may be <i>added</i> to missing results to delay the visibility of the delete.</li>
</ol>
<p>There are two ways of choosing which keys (filenames) will be affected: By substring, and by random probability.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.failinject.inconsistency.key.substring&lt;/name&gt;
&lt;value&gt;DELAY_LISTING_ME&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.failinject.inconsistency.probability&lt;/name&gt;
&lt;value&gt;1.0&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>By default, any object which has the substring &#x201c;DELAY_LISTING_ME&#x201d; in its key will subject to delayed visibility. For example, the path <tt>s3a://my-bucket/test/DELAY_LISTING_ME/file.txt</tt> would match this condition. To match all keys use the value &#x201c;*&#x201d; (a single asterisk). This is a special value: <i>We don&#x2019;t support arbitrary wildcards.</i></p>
<p>The default probability of delaying an object is 1.0. This means that <i>all</i> keys that match the substring will get delayed visibility. Note that we take the logical <i>and</i> of the two conditions (substring matches <i>and</i> probability random chance occurs). Here are some example configurations:</p>
<div>
<div>
<pre class="source">| substring | probability | behavior |
|-----------|-------------|--------------------------------------------|
| | 0.001 | An empty &lt;value&gt; tag in .xml config will |
| | | be interpreted as unset and revert to the |
| | | default value, &quot;DELAY_LISTING_ME&quot; |
| | | |
| * | 0.001 | 1/1000 chance of *any* key being delayed. |
| | | |
| delay | 0.01 | 1/100 chance of any key containing &quot;delay&quot; |
| | | |
| delay | 1.0 | All keys containing substring &quot;delay&quot; .. |
</pre></div></div>
<p>You can also configure how long you want the delay in visibility to last. The default is 5000 milliseconds (five seconds).</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.failinject.inconsistency.msec&lt;/name&gt;
&lt;value&gt;5000&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="Limitations_of_Inconsistency_Injection"></a>Limitations of Inconsistency Injection</h3>
<p>Although <tt>InconsistentAmazonS3Client</tt> can delay the visibility of an object or parent directory, it does not prevent the key of that object from appearing in all prefix searches. For example, if we create the following object with the default configuration above, in an otherwise empty bucket:</p>
<div>
<div>
<pre class="source">s3a://bucket/a/b/c/DELAY_LISTING_ME
</pre></div></div>
<p>Then the following paths will still be visible as directories (ignoring possible real-world inconsistencies):</p>
<div>
<div>
<pre class="source">s3a://bucket/a
s3a://bucket/a/b
</pre></div></div>
<p>Whereas <tt>getFileStatus()</tt> on the following <i>will</i> be subject to delayed visibility (<tt>FileNotFoundException</tt> until delay has elapsed):</p>
<div>
<div>
<pre class="source">s3a://bucket/a/b/c
s3a://bucket/a/b/c/DELAY_LISTING_ME
</pre></div></div>
<p>In real-life S3 inconsistency, however, we expect that all the above paths (including <tt>a</tt> and <tt>b</tt>) will be subject to delayed visibility.</p></div>
<div class="section">
<h3><a name="Using_the_InconsistentAmazonS3CClient_in_downstream_integration_tests"></a>Using the <tt>InconsistentAmazonS3CClient</tt> in downstream integration tests</h3>
<p>The inconsistent client is shipped in the <tt>hadoop-aws</tt> JAR, so it can be used in applications which work with S3 to see how they handle inconsistent directory listings.</p></div></div>
<div class="section">
<h2><a name="Testing_S3Guard"></a><a name="s3guard"></a> Testing S3Guard</h2>
<p><a href="./s3guard.html">S3Guard</a> is an extension to S3A which added consistent metadata listings to the S3A client.</p>
<p>It has not been needed for applications to work safely with AWS S3 since November 2020. However, it is currently still part of the codebase, and so something which needs to be tested.</p>
<p>The basic strategy for testing S3Guard correctness consists of:</p>
<ol style="list-style-type: decimal">
<li>
<p>MetadataStore Contract tests.</p>
<p>The MetadataStore contract tests are inspired by the Hadoop FileSystem and <tt>FileContext</tt> contract tests. Each implementation of the <tt>MetadataStore</tt> interface subclasses the <tt>MetadataStoreTestBase</tt> class and customizes it to initialize their MetadataStore. This test ensures that the different implementations all satisfy the semantics of the MetadataStore API.</p>
</li>
<li>
<p>Running existing S3A unit and integration tests with S3Guard enabled.</p>
<p>You can run the S3A integration tests on top of S3Guard by configuring your <tt>MetadataStore</tt> in your <tt>hadoop-tools/hadoop-aws/src/test/resources/core-site.xml</tt> or <tt>hadoop-tools/hadoop-aws/src/test/resources/auth-keys.xml</tt> files. Next run the S3A integration tests as outlined in the <i>Running the Tests</i> section of the <a href="./index.html">S3A documentation</a></p>
</li>
<li>
<p>Running fault-injection tests that test S3Guard&#x2019;s consistency features.</p>
<p>The <tt>ITestS3GuardListConsistency</tt> uses failure injection to ensure that list consistency logic is correct even when the underlying storage is eventually consistent.</p>
<p>The integration test adds a shim above the Amazon S3 Client layer that injects delays in object visibility.</p>
<p>All of these tests will be run if you follow the steps listed in step 2 above.</p>
<p>No charges are incurred for using this store, and its consistency guarantees are that of the underlying object store instance. <!-- :) --></p>
</li>
</ol>
<div class="section">
<h3><a name="Testing_S3A_with_S3Guard_Enabled"></a>Testing S3A with S3Guard Enabled</h3>
<p>All the S3A tests which work with a private repository can be configured to run with S3Guard by using the <tt>s3guard</tt> profile. When set, this will run all the tests with local memory for the metadata set to &#x201c;non-authoritative&#x201d; mode.</p>
<div>
<div>
<pre class="source">mvn -T 1C verify -Dparallel-tests -DtestsThreadCount=6 -Ds3guard
</pre></div></div>
<p>When the <tt>s3guard</tt> profile is enabled, following profiles can be specified:</p>
<ul>
<li><tt>dynamo</tt>: use an AWS-hosted DynamoDB table; creating the table if it does not exist. You will have to pay the bills for DynamoDB web service.</li>
<li><tt>auth</tt>: treat the S3Guard metadata as authoritative.</li>
</ul>
<div>
<div>
<pre class="source">mvn -T 1C verify -Dparallel-tests -DtestsThreadCount=6 -Ds3guard -Ddynamo -Dauth
</pre></div></div>
<p>When experimenting with options, it is usually best to run a single test suite at a time until the operations appear to be working.</p>
<div>
<div>
<pre class="source">mvn -T 1C verify -Dtest=skip -Dit.test=ITestS3AMiscOperations -Ds3guard -Ddynamo
</pre></div></div>
</div>
<div class="section">
<h3><a name="Notes"></a>Notes</h3>
<ol style="list-style-type: decimal">
<li>If the <tt>s3guard</tt> profile is not set, then the S3Guard properties are those of the test configuration set in s3a contract xml file or <tt>auth-keys.xml</tt></li>
</ol>
<p>If the <tt>s3guard</tt> profile <i>is</i> set: 1. The S3Guard options from maven (the dynamo and authoritative flags) overwrite any previously set in the configuration files. 1. DynamoDB will be configured to create any missing tables. 1. When using DynamoDB and running <tt>ITestDynamoDBMetadataStore</tt>, the <tt>fs.s3a.s3guard.ddb.test.table</tt> property MUST be configured, and the name of that table MUST be different than what is used for <tt>fs.s3a.s3guard.ddb.table</tt>. The test table is destroyed and modified multiple times during the test. 1. Several of the tests create and destroy DynamoDB tables. The table names are prefixed with the value defined by <tt>fs.s3a.s3guard.test.dynamo.table.prefix</tt> (default=&#x201c;s3guard.test.&#x201d;). The user executing the tests will need sufficient privilege to create and destroy such tables. If the tests abort uncleanly, these tables may be left behind, incurring AWS charges.</p></div>
<div class="section">
<h3><a name="How_to_Dump_the_Table_and_Metastore_State"></a>How to Dump the Table and Metastore State</h3>
<p>There&#x2019;s an unstable entry point to list the contents of a table and S3 filesystem ot a set of Tab Separated Value files:</p>
<div>
<div>
<pre class="source">hadoop org.apache.hadoop.fs.s3a.s3guard.DumpS3GuardDynamoTable s3a://bucket/ dir/out
</pre></div></div>
<p>This generates a set of files prefixed <tt>dir/out-</tt> with different views of the world which can then be viewed on the command line or editor:</p>
<div>
<div>
<pre class="source">&quot;type&quot; &quot;deleted&quot; &quot;path&quot; &quot;is_auth_dir&quot; &quot;is_empty_dir&quot; &quot;len&quot; &quot;updated&quot; &quot;updated_s&quot; &quot;last_modified&quot; &quot;last_modified_s&quot; &quot;etag&quot; &quot;version&quot;
&quot;file&quot; &quot;true&quot; &quot;s3a://bucket/fork-0001/test/ITestS3AContractDistCp/testDirectWrite/remote&quot; &quot;false&quot; &quot;UNKNOWN&quot; 0 1562171244451 &quot;Wed Jul 03 17:27:24 BST 2019&quot; 1562171244451 &quot;Wed Jul 03 17:27:24 BST 2019&quot; &quot;&quot; &quot;&quot;
&quot;file&quot; &quot;true&quot; &quot;s3a://bucket/Users/stevel/Projects/hadoop-trunk/hadoop-tools/hadoop-aws/target/test-dir/1/5xlPpalRwv/test/new/newdir/file1&quot; &quot;false&quot; &quot;UNKNOWN&quot; 0 1562171518435 &quot;Wed Jul 03 17:31:58 BST 2019&quot; 1562171518435 &quot;Wed Jul 03 17:31:58 BST 2019&quot; &quot;&quot; &quot;&quot;
&quot;file&quot; &quot;true&quot; &quot;s3a://bucket/Users/stevel/Projects/hadoop-trunk/hadoop-tools/hadoop-aws/target/test-dir/1/5xlPpalRwv/test/new/newdir/subdir&quot; &quot;false&quot; &quot;UNKNOWN&quot; 0 1562171518535 &quot;Wed Jul 03 17:31:58 BST 2019&quot; 1562171518535 &quot;Wed Jul 03 17:31:58 BST 2019&quot; &quot;&quot; &quot;&quot;
&quot;file&quot; &quot;true&quot; &quot;s3a://bucket/test/DELAY_LISTING_ME/testMRJob&quot; &quot;false&quot; &quot;UNKNOWN&quot; 0 1562172036299 &quot;Wed Jul 03 17:40:36 BST 2019&quot; 1562172036299 &quot;Wed Jul 03 17:40:36 BST 2019&quot; &quot;&quot; &quot;&quot;
</pre></div></div>
<p>This is unstable: the output format may change without warning. To understand the meaning of the fields, consult the documentation. They are, currently:</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> field </th>
<th> meaning </th>
<th> source </th></tr>
</thead><tbody>
<tr class="b">
<td> <tt>type</tt> </td>
<td> type </td>
<td> filestatus </td></tr>
<tr class="a">
<td> <tt>deleted</tt> </td>
<td> tombstone marker </td>
<td> metadata </td></tr>
<tr class="b">
<td> <tt>path</tt> </td>
<td> path of an entry </td>
<td> filestatus </td></tr>
<tr class="a">
<td> <tt>is_auth_dir</tt> </td>
<td> directory entry authoritative status </td>
<td> metadata </td></tr>
<tr class="b">
<td> <tt>is_empty_dir</tt> </td>
<td> does the entry represent an empty directory </td>
<td> metadata </td></tr>
<tr class="a">
<td> <tt>len</tt> </td>
<td> file length </td>
<td> filestatus </td></tr>
<tr class="b">
<td> <tt>last_modified</tt> </td>
<td> file status last modified </td>
<td> filestatus </td></tr>
<tr class="a">
<td> <tt>last_modified_s</tt> </td>
<td> file status last modified as string </td>
<td> filestatus </td></tr>
<tr class="b">
<td> <tt>updated</tt> </td>
<td> time (millis) metadata was updated </td>
<td> metadata </td></tr>
<tr class="a">
<td> <tt>updated_s</tt> </td>
<td> updated time as a string </td>
<td> metadata </td></tr>
<tr class="b">
<td> <tt>etag</tt> </td>
<td> any etag </td>
<td> filestatus </td></tr>
<tr class="a">
<td> <tt>version</tt> </td>
<td> any version</td>
<td> filestatus </td></tr>
</tbody>
</table>
<p>Files generated</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> suffix </th>
<th> content </th></tr>
</thead><tbody>
<tr class="b">
<td> <tt>-scan.csv</tt> </td>
<td> Full scan/dump of the metastore </td></tr>
<tr class="a">
<td> <tt>-store.csv</tt> </td>
<td> Recursive walk through the metastore </td></tr>
<tr class="b">
<td> <tt>-tree.csv</tt> </td>
<td> Treewalk through filesystem <tt>listStatus(&quot;/&quot;)</tt> calls </td></tr>
<tr class="a">
<td> <tt>-flat.csv</tt> </td>
<td> Flat listing through filesystem <tt>listFiles(&quot;/&quot;, recursive)</tt> </td></tr>
<tr class="b">
<td> <tt>-s3.csv</tt> </td>
<td> Dump of the S3 Store <i>only</i> </td></tr>
<tr class="a">
<td> <tt>-scan-2.csv</tt> </td>
<td> Scan of the store after the previous operations </td></tr>
</tbody>
</table>
<p>Why the two scan entries? The S3A listing and treewalk operations may add new entries to the metastore/DynamoDB table.</p>
<p>Note 1: this is unstable; entry list and meaning may change, sorting of output, the listing algorithm, representation of types, etc. It&#x2019;s expected uses are: diagnostics, support calls and helping us developers work out what we&#x2019;ve just broken.</p>
<p>Note 2: This <i>is</i> safe to use against an active store; the tables may appear to be inconsistent due to changes taking place during the dump sequence.</p></div>
<div class="section">
<h3><a name="Resetting_the_Metastore:_PurgeS3GuardDynamoTable"></a>Resetting the Metastore: <tt>PurgeS3GuardDynamoTable</tt></h3>
<p>The <tt>PurgeS3GuardDynamoTable</tt> entry point <tt>org.apache.hadoop.fs.s3a.s3guard.PurgeS3GuardDynamoTable</tt> can list all entries in a store for a specific filesystem, and delete them. It <i>only</i> deletes those entries in the store for that specific filesystem, even if the store is shared.</p>
<div>
<div>
<pre class="source">hadoop org.apache.hadoop.fs.s3a.s3guard.PurgeS3GuardDynamoTable \
-force s3a://bucket/
</pre></div></div>
<p>Without the <tt>-force</tt> option the table is scanned, but no entries deleted; with it then all entries for that filesystem are deleted. No attempt is made to order the deletion; while the operation is under way the store is not fully connected (i.e. there may be entries whose parent has already been deleted).</p>
<p>Needless to say: <i>it is not safe to use this against a table in active use.</i></p></div>
<div class="section">
<h3><a name="Scale_Testing_MetadataStore_Directly"></a>Scale Testing MetadataStore Directly</h3>
<p>There are some scale tests that exercise Metadata Store implementations directly. These ensure that S3Guard is are robust to things like DynamoDB throttling, and compare performance for different implementations. These are included in the scale tests executed when <tt>-Dscale</tt> is passed to the maven command line.</p>
<p>The two S3Guard scale tests are <tt>ITestDynamoDBMetadataStoreScale</tt> and <tt>ITestLocalMetadataStoreScale</tt>.</p>
<p>To run these tests, your DynamoDB table needs to be of limited capacity; the values in <tt>ITestDynamoDBMetadataStoreScale</tt> currently require a read capacity of 10 or less. a write capacity of 15 or more.</p>
<p>The following settings allow us to run <tt>ITestDynamoDBMetadataStoreScale</tt> with artificially low read and write capacity provisioned, so we can judge the effects of being throttled by the DynamoDB service:</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;scale.test.operation.count&lt;/name&gt;
&lt;value&gt;10&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;scale.test.directory.count&lt;/name&gt;
&lt;value&gt;3&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.scale.test.enabled&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.s3guard.ddb.table&lt;/name&gt;
&lt;value&gt;my-scale-test&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.s3guard.ddb.table.create&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.s3guard.ddb.table.capacity.read&lt;/name&gt;
&lt;value&gt;5&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.s3guard.ddb.table.capacity.write&lt;/name&gt;
&lt;value&gt;5&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>These tests verify that the invoked operations can trigger retries in the S3Guard code, rather than just in the AWS SDK level, so showing that if SDK operations fail, they get retried. They also verify that the filesystem statistics are updated to record that throttling took place.</p>
<p><i>Do not panic if these tests fail to detect throttling!</i></p>
<p>These tests are unreliable as they need certain conditions to be met to repeatedly fail:</p>
<ol style="list-style-type: decimal">
<li>You must have a low-enough latency connection to the DynamoDB store that, for the capacity allocated, you can overload it.</li>
<li>The AWS Console can give you a view of what is happening here.</li>
<li>Running a single test on its own is less likely to trigger an overload than trying to run the whole test suite.</li>
<li>And running the test suite more than once, back-to-back, can also help overload the cluster.</li>
<li>Stepping through with a debugger will reduce load, so may not trigger failures.</li>
</ol>
<p>If the tests fail, it <i>probably</i> just means you aren&#x2019;t putting enough load on the table.</p>
<p>These tests do not verify that the entire set of DynamoDB calls made during the use of a S3Guarded S3A filesystem are wrapped by retry logic.</p>
<p>*The best way to verify resilience is to run the entire <tt>hadoop-aws</tt> test suite, or even a real application, with throttling enabled.</p></div>
<div class="section">
<h3><a name="Testing_encrypted_DynamoDB_tables"></a>Testing encrypted DynamoDB tables</h3>
<p>By default, a DynamoDB table is encrypted using AWS owned customer master key (CMK). You can enable server side encryption (SSE) using AWS managed CMK or customer managed CMK in KMS before running the S3Guard tests. 1. To enable AWS managed CMK, set the config <tt>fs.s3a.s3guard.ddb.table.sse.enabled</tt> to true in <tt>auth-keys.xml</tt>. 1. To enable customer managed CMK, you need to create a KMS key and set the config in <tt>auth-keys.xml</tt>. The value can be the key ARN or alias. Example:</p>
<div>
<div>
<pre class="source"> &lt;property&gt;
&lt;name&gt;fs.s3a.s3guard.ddb.table.sse.enabled&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.s3a.s3guard.ddb.table.sse.cmk&lt;/name&gt;
&lt;value&gt;arn:aws:kms:us-west-2:360379543683:key/071a86ff-8881-4ba0-9230-95af6d01ca01&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>For more details about SSE on DynamoDB table, please see <a href="./s3guard.html">S3Guard doc</a>.</p></div>
<div class="section">
<h3><a name="Testing_only:_Local_Metadata_Store"></a>Testing only: Local Metadata Store</h3>
<p>There is an in-memory Metadata Store for testing.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.metadatastore.impl&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.s3guard.LocalMetadataStore&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>This is not for use in production.</p></div></div>
<div class="section">
<h2><a name="Testing_Assumed_Roles"></a><a name="assumed_roles"></a> Testing Assumed Roles</h2>
<p>Tests for the AWS Assumed Role credential provider require an assumed role to request.</p>
<p>If this role is not declared in <tt>fs.s3a.assumed.role.arn</tt>, the tests which require it will be skipped.</p>
<p>The specific tests an Assumed Role ARN is required for are</p>
<ul>
<li><tt>ITestAssumeRole</tt>.</li>
<li><tt>ITestRoleDelegationTokens</tt>.</li>
<li>One of the parameterized test cases in <tt>ITestDelegatedMRJob</tt>.</li>
</ul>
<p>To run these tests you need:</p>
<ol style="list-style-type: decimal">
<li>A role in your AWS account will full read and write access rights to the S3 bucket used in the tests, DynamoDB, for S3Guard, and KMS for any SSE-KMS tests.</li>
</ol>
<p>If your bucket is set up by default to use S3Guard, the role must have access to that service.</p>
<ol style="list-style-type: decimal">
<li>
<p>Your IAM User to have the permissions to &#x201c;assume&#x201d; that role.</p>
</li>
<li>
<p>The role ARN must be set in <tt>fs.s3a.assumed.role.arn</tt>.</p>
</li>
</ol>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.assumed.role.arn&lt;/name&gt;
&lt;value&gt;arn:aws:iam::9878543210123:role/role-s3-restricted&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The tests assume the role with different subsets of permissions and verify that the S3A client (mostly) works when the caller has only write access to part of the directory tree.</p>
<p>You can also run the entire test suite in an assumed role, a more thorough test, by switching to the credentials provider.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.s3a.aws.credentials.provider&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The usual credentials needed to log in to the bucket will be used, but now the credentials used to interact with S3 and DynamoDB will be temporary role credentials, rather than the full credentials.</p></div>
<div class="section">
<h2><a name="Qualifying_an_AWS_SDK_Update"></a><a name="qualifiying_sdk_updates"></a> Qualifying an AWS SDK Update</h2>
<p>Updating the AWS SDK is something which does need to be done regularly, but is rarely without complications, major or minor.</p>
<p>Assume that the version of the SDK will remain constant for an X.Y release, excluding security fixes, so it&#x2019;s good to have an update before each release &#x2014; as long as that update works doesn&#x2019;t trigger any regressions.</p>
<ol style="list-style-type: decimal">
<li>Don&#x2019;t make this a last minute action.</li>
<li>The upgrade patch should focus purely on the SDK update, so it can be cherry picked and reverted easily.</li>
<li>Do not mix in an SDK update with any other piece of work, for the same reason.</li>
<li>Plan for an afternoon&#x2019;s work, including before/after testing, log analysis and any manual tests.</li>
<li>Make sure all the integration tests are running (including s3guard, ARN, encryption, scale) <i>before you start the upgrade</i>.</li>
<li>Create a JIRA for updating the SDK. Don&#x2019;t include the version (yet), as it may take a couple of SDK updates before it is ready.</li>
<li>Identify the latest AWS SDK <a class="externalLink" href="https://aws.amazon.com/sdk-for-java/">available for download</a>.</li>
<li>Create a private git branch of trunk for JIRA, and in <tt>hadoop-project/pom.xml</tt> update the <tt>aws-java-sdk.version</tt> to the new SDK version.</li>
<li>Update AWS SDK versions in NOTICE.txt.</li>
<li>Do a clean build and rerun all the <tt>hadoop-aws</tt> tests, with and without the <tt>-Ds3guard -Ddynamo</tt> options. This includes the <tt>-Pscale</tt> set, with a role defined for the assumed role tests. in <tt>fs.s3a.assumed.role.arn</tt> for testing assumed roles, and <tt>fs.s3a.server-side-encryption.key</tt> for encryption, for full coverage. If you can, scale up the scale tests.</li>
<li>Run the <tt>ILoadTest*</tt> load tests from your IDE or via maven through <tt>mvn verify -Dtest=skip -Dit.test=ILoadTest\*</tt> ; look for regressions in performance as much as failures.</li>
<li>Create the site with <tt>mvn site -DskipTests</tt>; look in <tt>target/site</tt> for the report.</li>
<li>Review *every single <tt>-output.txt</tt> file in <tt>hadoop-tools/hadoop-aws/target/failsafe-reports</tt>, paying particular attention to <tt>org.apache.hadoop.fs.s3a.scale.ITestS3AInputStreamPerformance-output.txt</tt>, as that is where changes in stream close/abort logic will surface.</li>
<li>Run <tt>mvn install</tt> to install the artifacts, then in <tt>hadoop-cloud-storage-project/hadoop-cloud-storage</tt> run <tt>mvn dependency:tree -Dverbose &gt; target/dependencies.txt</tt>. Examine the <tt>target/dependencies.txt</tt> file to verify that no new artifacts have unintentionally been declared as dependencies of the shaded <tt>aws-java-sdk-bundle</tt> artifact.</li>
</ol>
<div class="section">
<h3><a name="Basic_command_line_regression_testing"></a>Basic command line regression testing</h3>
<p>We need a run through of the CLI to see if there have been changes there which cause problems, especially whether new log messages have surfaced, or whether some packaging change breaks that CLI</p>
<p>From the root of the project, create a command line release <tt>mvn package -Pdist -DskipTests -Dmaven.javadoc.skip=true -DskipShade</tt>;</p>
<ol style="list-style-type: decimal">
<li>Change into the <tt>hadoop-dist/target/hadoop-x.y.z-SNAPSHOT</tt> dir.</li>
<li>Copy a <tt>core-site.xml</tt> file into <tt>etc/hadoop</tt>.</li>
<li>Set the <tt>HADOOP_OPTIONAL_TOOLS</tt> env var on the command line or <tt>~/.hadoop-env</tt>.</li>
</ol>
<div>
<div>
<pre class="source">export HADOOP_OPTIONAL_TOOLS=&quot;hadoop-aws&quot;
</pre></div></div>
<p>Run some basic s3guard commands as well as file operations.</p>
<div>
<div>
<pre class="source">export BUCKETNAME=example-bucket-name
export BUCKET=s3a://$BUCKETNAME
bin/hadoop s3guard bucket-info $BUCKET
bin/hadoop s3guard uploads $BUCKET
# repeat twice, once with &quot;no&quot; and once with &quot;yes&quot; as responses
bin/hadoop s3guard uploads -abort $BUCKET
# ---------------------------------------------------
# assuming s3guard is enabled
# if on pay-by-request, expect an error message and exit code of -1
bin/hadoop s3guard set-capacity $BUCKET
# skip for PAY_PER_REQUEST
bin/hadoop s3guard set-capacity -read 15 -write 15 $BUCKET
bin/hadoop s3guard bucket-info -guarded $BUCKET
bin/hadoop s3guard diff $BUCKET/
bin/hadoop s3guard prune -minutes 10 $BUCKET/
bin/hadoop s3guard import -verbose $BUCKET/
bin/hadoop s3guard authoritative -verbose $BUCKET
# ---------------------------------------------------
# root filesystem operatios
# ---------------------------------------------------
bin/hadoop fs -ls $BUCKET/
# assuming file is not yet created, expect error and status code of 1
bin/hadoop fs -ls $BUCKET/file
# exit code of 0 even when path doesn't exist
bin/hadoop fs -rm -R -f $BUCKET/dir-no-trailing
bin/hadoop fs -rm -R -f $BUCKET/dir-trailing/
# error because it is a directory
bin/hadoop fs -rm $BUCKET/
bin/hadoop fs -touchz $BUCKET/file
# expect I/O error as it is the root directory
bin/hadoop fs -rm -r $BUCKET/
# succeeds
bin/hadoop fs -rm -r $BUCKET/\*
# ---------------------------------------------------
# File operations
# ---------------------------------------------------
bin/hadoop fs -mkdir $BUCKET/dir-no-trailing
# used to fail with S3Guard
bin/hadoop fs -mkdir $BUCKET/dir-trailing/
bin/hadoop fs -touchz $BUCKET/file
bin/hadoop fs -ls $BUCKET/
bin/hadoop fs -mv $BUCKET/file $BUCKET/file2
# expect &quot;No such file or directory&quot;
bin/hadoop fs -stat $BUCKET/file
# expect success
bin/hadoop fs -stat $BUCKET/file2
# expect &quot;file exists&quot;
bin/hadoop fs -mkdir $BUCKET/dir-no-trailing
bin/hadoop fs -mv $BUCKET/file2 $BUCKET/dir-no-trailing
bin/hadoop fs -stat $BUCKET/dir-no-trailing/file2
# treated the same as the file stat
bin/hadoop fs -stat $BUCKET/dir-no-trailing/file2/
bin/hadoop fs -ls $BUCKET/dir-no-trailing/file2/
bin/hadoop fs -ls $BUCKET/dir-no-trailing
# expect a &quot;0&quot; here:
bin/hadoop fs -test -d $BUCKET/dir-no-trailing ; echo $?
# expect a &quot;1&quot; here:
bin/hadoop fs -test -d $BUCKET/dir-no-trailing/file2 ; echo $?
# will return NONE unless bucket has checksums enabled
bin/hadoop fs -checksum $BUCKET/dir-no-trailing/file2
# expect &quot;etag&quot; + a long string
bin/hadoop fs -D fs.s3a.etag.checksum.enabled=true -checksum $BUCKET/dir-no-trailing/file2
bin/hadoop fs -expunge -immediate -fs $BUCKET
# ---------------------------------------------------
# Delegation Token support
# ---------------------------------------------------
# failure unless delegation tokens are enabled
bin/hdfs fetchdt --webservice $BUCKET secrets.bin
# success
bin/hdfs fetchdt -D fs.s3a.delegation.token.binding=org.apache.hadoop.fs.s3a.auth.delegation.SessionTokenBinding --webservice $BUCKET secrets.bin
bin/hdfs fetchdt -print secrets.bin
# expect warning &quot;No TokenRenewer defined for token kind S3ADelegationToken/Session&quot;
bin/hdfs fetchdt -renew secrets.bin
# ---------------------------------------------------
# Directory markers
# ---------------------------------------------------
# require success
bin/hadoop s3guard bucket-info -markers aware $BUCKET
# expect failure unless bucket policy is keep
bin/hadoop s3guard bucket-info -markers keep $BUCKET/path
# you may need to set this on a per-bucket basis if you have already been
# playing with options
bin/hadoop s3guard -D fs.s3a.directory.marker.retention=keep bucket-info -markers keep $BUCKET/path
bin/hadoop s3guard -D fs.s3a.bucket.$BUCKETNAME.directory.marker.retention=keep bucket-info -markers keep $BUCKET/path
# expect to see &quot;Directory markers will be kept&quot; messages and status code of &quot;46&quot;
bin/hadoop fs -D fs.s3a.bucket.$BUCKETNAME.directory.marker.retention=keep -mkdir $BUCKET/p1
bin/hadoop fs -D fs.s3a.bucket.$BUCKETNAME.directory.marker.retention=keep -mkdir $BUCKET/p1/p2
bin/hadoop fs -D fs.s3a.bucket.$BUCKETNAME.directory.marker.retention=keep -touchz $BUCKET/p1/p2/file
# expect failure as markers will be found for /p1/ and /p1/p2/
bin/hadoop s3guard markers -audit -verbose $BUCKET
# clean will remove markers
bin/hadoop s3guard markers -clean -verbose $BUCKET
# expect success and exit code of 0
bin/hadoop s3guard markers -audit -verbose $BUCKET
# ---------------------------------------------------
# S3 Select on Landsat
# ---------------------------------------------------
export LANDSATGZ=s3a://landsat-pds/scene_list.gz
bin/hadoop s3guard select -header use -compression gzip $LANDSATGZ \
&quot;SELECT s.entityId,s.cloudCover FROM S3OBJECT s WHERE s.cloudCover &lt; '0.0' LIMIT 100&quot;
</pre></div></div>
</div>
<div class="section">
<h3><a name="Other_tests"></a>Other tests</h3>
<ul>
<li>Whatever applications you have which use S3A: build and run them before the upgrade, Then see if complete successfully in roughly the same time once the upgrade is applied.</li>
<li>Test any third-party endpoints you have access to.</li>
<li>Try different regions (especially a v4 only region), and encryption settings.</li>
<li>Any performance tests you have can identify slowdowns, which can be a sign of changed behavior in the SDK (especially on stream reads and writes).</li>
<li>If you can, try to test in an environment where a proxy is needed to talk to AWS services.</li>
<li>Try and get other people, especially anyone with their own endpoints, apps or different deployment environments, to run their own tests.</li>
<li>Run the load tests, especially <tt>ILoadTestS3ABulkDeleteThrottling</tt>.</li>
</ul></div>
<div class="section">
<h3><a name="Dealing_with_Deprecated_APIs_and_New_Features"></a>Dealing with Deprecated APIs and New Features</h3>
<p>A Jenkins run should tell you if there are new deprecations. If so, you should think about how to deal with them.</p>
<p>Moving to methods and APIs which weren&#x2019;t in the previous SDK release makes it harder to roll back if there is a problem; but there may be good reasons for the deprecation.</p>
<p>At the same time, there may be good reasons for staying with the old code.</p>
<ul>
<li>AWS have embraced the builder pattern for new operations; note that objects constructed this way often have their (existing) setter methods disabled; this may break existing code.</li>
<li>New versions of S3 calls (list v2, bucket existence checks, bulk operations) may be better than the previous HTTP operations &amp; APIs, but they may not work with third-party endpoints, so can only be adopted if made optional, which then adds a new configuration option (with docs, testing, &#x2026;). A change like that must be done in its own patch, with its new tests which compare the old vs new operations.</li>
</ul></div>
<div class="section">
<h3><a name="Committing_the_patch"></a>Committing the patch</h3>
<p>When the patch is committed: update the JIRA to the version number actually used; use that title in the commit message.</p>
<p>Be prepared to roll-back, re-iterate or code your way out of a regression.</p>
<p>There may be some problem which surfaces with wider use, which can get fixed in a new AWS release, rolling back to an older one, or just worked around <a class="externalLink" href="https://issues.apache.org/jira/browse/HADOOP-14596">HADOOP-14596</a>.</p>
<p>Don&#x2019;t be surprised if this happens, don&#x2019;t worry too much, and, while that rollback option is there to be used, ideally try to work forwards.</p>
<p>If the problem is with the SDK, file issues with the <a class="externalLink" href="https://github.com/aws/aws-sdk-java/issues">AWS SDK Bug tracker</a>. If the problem can be fixed or worked around in the Hadoop code, do it there too.</p></div></div>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2021
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>