blob: 2efa32e7b7e1cdbe96b80b6f27ec26bb775499f2 [file] [log] [blame]
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2021-06-15
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop Azure support &#x2013; Hadoop Azure Support: ABFS — Azure Data Lake Storage Gen2</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20210615" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xleft">
<a href="http://www.apache.org/" class="externalLink">Apache</a>
&gt;
<a href="http://hadoop.apache.org/" class="externalLink">Hadoop</a>
&gt;
<a href="index.html">Apache Hadoop Azure support</a>
&gt;
Hadoop Azure Support: ABFS — Azure Data Lake Storage Gen2
</div>
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
&nbsp;| Last Published: 2021-06-15
&nbsp;| Version: 3.3.1
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../hadoop-openstack/index.html">OpenStack Swift</a>
</li>
<li class="none">
<a href="../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>Hadoop Azure Support: ABFS &#x2014; Azure Data Lake Storage Gen2</h1>
<ul>
<li><a href="#Introduction"> Introduction</a></li>
<li><a href="#Features_of_the_ABFS_connector."> Features of the ABFS connector.</a></li>
<li><a href="#Getting_started">Getting started</a>
<ul>
<li><a href="#Concepts">Concepts</a></li></ul></li>
<li><a href="#Hierarchical_Namespaces_.28and_WASB_Compatibility.29"> Hierarchical Namespaces (and WASB Compatibility)</a>
<ul>
<li><a href="#Creating_an_Azure_Storage_Account"> Creating an Azure Storage Account</a>
<ul>
<li><a href="#Creation_through_the_Azure_Portal">Creation through the Azure Portal</a></li></ul></li>
<li><a href="#Creating_a_new_container"> Creating a new container</a></li>
<li><a href="#Listing_and_examining_containers_of_a_Storage_Account.">Listing and examining containers of a Storage Account.</a></li></ul></li>
<li><a href="#Configuring_ABFS"> Configuring ABFS</a></li>
<li><a href="#Authentication"> Authentication</a>
<ul>
<li><a href="#AAD_Token_fetch_retries"> AAD Token fetch retries</a></li>
<li><a href="#Default:_Shared_Key"> Default: Shared Key</a></li>
<li><a href="#OAuth_2.0_Client_Credentials"> OAuth 2.0 Client Credentials</a></li>
<li><a href="#OAuth_2.0:_Username_and_Password"> OAuth 2.0: Username and Password</a></li>
<li><a href="#OAuth_2.0:_Refresh_Token"> OAuth 2.0: Refresh Token</a></li>
<li><a href="#Azure_Managed_Identity"> Azure Managed Identity</a></li>
<li><a href="#Custom_OAuth_2.0_Token_Provider">Custom OAuth 2.0 Token Provider</a></li>
<li><a href="#Delegation_Token_Provider"> Delegation Token Provider</a></li>
<li><a href="#Shared_Access_Signature_.28SAS.29_Token_Provider">Shared Access Signature (SAS) Token Provider</a></li></ul></li>
<li><a href="#Technical_notes"> Technical notes</a>
<ul>
<li><a href="#Proxy_setup"> Proxy setup</a></li>
<li><a href="#Security"> Security</a></li>
<li><a href="#Limitations_of_the_ABFS_connector"> Limitations of the ABFS connector</a></li>
<li><a href="#Consistency_and_Concurrency"> Consistency and Concurrency</a></li>
<li><a href="#Performance_and_Scalability"> Performance and Scalability</a></li>
<li><a href="#Extensibility"> Extensibility</a></li></ul></li>
<li><a href="#Other_configuration_options"> Other configuration options</a>
<ul>
<li><a href="#Flush_Options"> Flush Options</a>
<ul>
<li><a href="#a1._Azure_Blob_File_System_Flush_Options"> 1. Azure Blob File System Flush Options</a></li>
<li><a href="#a2._OutputStream_Flush_Options"> 2. OutputStream Flush Options</a></li></ul></li>
<li><a href="#HNS_Check_Options"> HNS Check Options</a></li>
<li><a href="#Access_Options"> Access Options</a></li>
<li><a href="#Operation_Idempotency"> Operation Idempotency</a></li>
<li><a href="#Primary_User_Group_Options"> Primary User Group Options</a></li>
<li><a href="#IO_Options"> IO Options</a></li>
<li><a href="#Security_Options"> Security Options</a></li>
<li><a href="#Server_Options"> Server Options</a></li>
<li><a href="#Throttling_Options"> Throttling Options</a></li>
<li><a href="#Rename_Options"> Rename Options</a></li>
<li><a href="#Infinite_Lease_Options"> Infinite Lease Options</a></li>
<li><a href="#Perf_Options"> Perf Options</a>
<ul>
<li><a href="#a1._HTTP_Request_Tracking_Options"> 1. HTTP Request Tracking Options</a></li></ul></li></ul></li>
<li><a href="#Troubleshooting"> Troubleshooting</a>
<ul>
<li><a href="#ClassNotFoundException:_org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem">ClassNotFoundException: org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem</a></li>
<li><a href="#ClassNotFoundException:_com.microsoft.azure.storage.StorageErrorCode">ClassNotFoundException: com.microsoft.azure.storage.StorageErrorCode</a></li>
<li><a href="#Server_failed_to_authenticate_the_request">Server failed to authenticate the request</a></li>
<li><a href="#Configuration_property__something_.dfs.core.windows.net_not_found">Configuration property _something_.dfs.core.windows.net not found</a></li>
<li><a href="#No_such_file_or_directory_when_trying_to_list_a_container">No such file or directory when trying to list a container</a></li>
<li><a href="#a.E2.80.9CHTTP_connection_to_https:.2F.2Flogin.microsoftonline.com.2Fsomething_failed_for_getting_token_from_AzureAD._Http_response:_200_OK.E2.80.9D">&#x201c;HTTP connection to https://login.microsoftonline.com/something failed for getting token from AzureAD. Http response: 200 OK&#x201d;</a></li>
<li><a href="#java.io.IOException:_The_ownership_on_the_staging_directory_.2Ftmp.2Fhadoop-yarn.2Fstaging.2Fuser1.2F.staging_is_not_as_expected._It_is_owned_by_.3Cprincipal_id.3E._The_directory_must_be_owned_by_the_submitter_user1_or_user1">java.io.IOException: The ownership on the staging directory /tmp/hadoop-yarn/staging/user1/.staging is not as expected. It is owned by &lt;principal_id&gt;. The directory must be owned by the submitter user1 or user1</a></li></ul></li>
<li><a href="#Testing_ABFS"> Testing ABFS</a></li></ul>
<div class="section">
<h2><a name="Introduction"></a><a name="introduction"></a> Introduction</h2>
<p>The <tt>hadoop-azure</tt> module provides support for the Azure Data Lake Storage Gen2 storage layer through the &#x201c;abfs&#x201d; connector</p>
<p>To make it part of Apache Hadoop&#x2019;s default classpath, make sure that <tt>HADOOP_OPTIONAL_TOOLS</tt> environment variable has <tt>hadoop-azure</tt> in the list, <i>on every machine in the cluster</i></p>
<div>
<div>
<pre class="source">export HADOOP_OPTIONAL_TOOLS=hadoop-azure
</pre></div></div>
<p>You can set this locally in your <tt>.profile</tt>/<tt>.bashrc</tt>, but note it won&#x2019;t propagate to jobs running in-cluster.</p></div>
<div class="section">
<h2><a name="Features_of_the_ABFS_connector."></a><a name="features"></a> Features of the ABFS connector.</h2>
<ul>
<li>Supports reading and writing data stored in an Azure Blob Storage account.</li>
<li><i>Fully Consistent</i> view of the storage across all clients.</li>
<li>Can read data written through the <tt>wasb:</tt> connector.</li>
<li>Presents a hierarchical file system view by implementing the standard Hadoop <a href="../api/org/apache/hadoop/fs/FileSystem.html"><tt>FileSystem</tt></a> interface.</li>
<li>Supports configuration of multiple Azure Blob Storage accounts.</li>
<li>Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, Apache Spark.</li>
<li>Tested at scale on both Linux and Windows by Microsoft themselves.</li>
<li>Can be used as a replacement for HDFS on Hadoop clusters deployed in Azure infrastructure.</li>
</ul>
<p>For details on ABFS, consult the following documents:</p>
<ul>
<li><a class="externalLink" href="https://azure.microsoft.com/en-gb/blog/a-closer-look-at-azure-data-lake-storage-gen2/">A closer look at Azure Data Lake Storage Gen2</a>; MSDN Article from June 28, 2018.</li>
<li><a class="externalLink" href="https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-storage-tiers">Storage Tiers</a></li>
</ul></div>
<div class="section">
<h2><a name="Getting_started"></a>Getting started</h2>
<div class="section">
<h3><a name="Concepts"></a>Concepts</h3>
<p>The Azure Storage data model presents 3 core concepts:</p>
<ul>
<li><b>Storage Account</b>: All access is done through a storage account.</li>
<li><b>Container</b>: A container is a grouping of multiple blobs. A storage account may have multiple containers. In Hadoop, an entire file system hierarchy is stored in a single container.</li>
<li><b>Blob</b>: A file of any type and size stored with the existing wasb connector</li>
</ul>
<p>The ABFS connector connects to classic containers, or those created with Hierarchical Namespaces.</p></div></div>
<div class="section">
<h2><a name="Hierarchical_Namespaces_.28and_WASB_Compatibility.29"></a><a name="namespaces"></a> Hierarchical Namespaces (and WASB Compatibility)</h2>
<p>A key aspect of ADLS Gen 2 is its support for <a class="externalLink" href="https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace">hierachical namespaces</a> These are effectively directories and offer high performance rename and delete operations &#x2014;something which makes a significant improvement in performance in query engines writing data to, including MapReduce, Spark, Hive, as well as DistCp.</p>
<p>This feature is only available if the container was created with &#x201c;namespace&#x201d; support.</p>
<p>You enable namespace support when creating a new Storage Account, by checking the &#x201c;Hierarchical Namespace&#x201d; option in the Portal UI, or, when creating through the command line, using the option <tt>--hierarchical-namespace true</tt></p>
<p><i>You cannot enable Hierarchical Namespaces on an existing storage account</i></p>
<p>Containers in a storage account with Hierarchical Namespaces are not (currently) readable through the <tt>wasb:</tt> connector.</p>
<p>Some of the <tt>az storage</tt> command line commands fail too, for example:</p>
<div>
<div>
<pre class="source">$ az storage container list --account-name abfswales1
Blob API is not yet supported for hierarchical namespace accounts. ErrorCode: BlobApiNotYetSupportedForHierarchicalNamespaceAccounts
</pre></div></div>
<div class="section">
<h3><a name="Creating_an_Azure_Storage_Account"></a><a name="creating"></a> Creating an Azure Storage Account</h3>
<p>The best documentation on getting started with Azure Datalake Gen2 with the abfs connector is <a class="externalLink" href="https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-use-hdi-cluster">Using Azure Data Lake Storage Gen2 with Azure HDInsight clusters</a></p>
<p>It includes instructions to create it from <a class="externalLink" href="https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest">the Azure command line tool</a>, which can be installed on Windows, MacOS (via Homebrew) and Linux (apt or yum).</p>
<p>The <a class="externalLink" href="https://docs.microsoft.com/en-us/cli/azure/storage?view=azure-cli-latest">az storage</a> subcommand handles all storage commands, <a class="externalLink" href="https://docs.microsoft.com/en-us/cli/azure/storage/account?view=azure-cli-latest#az-storage-account-create"><tt>az storage account create</tt></a> does the creation.</p>
<p>Until the ADLS gen2 API support is finalized, you need to add an extension to the ADLS command.</p>
<div>
<div>
<pre class="source">az extension add --name storage-preview
</pre></div></div>
<p>Check that all is well by verifying that the usage command includes <tt>--hierarchical-namespace</tt>:</p>
<div>
<div>
<pre class="source">$ az storage account
usage: az storage account create [-h] [--verbose] [--debug]
[--output {json,jsonc,table,tsv,yaml,none}]
[--query JMESPATH] --resource-group
RESOURCE_GROUP_NAME --name ACCOUNT_NAME
[--sku {Standard_LRS,Standard_GRS,Standard_RAGRS,Standard_ZRS,Premium_LRS,Premium_ZRS}]
[--location LOCATION]
[--kind {Storage,StorageV2,BlobStorage,FileStorage,BlockBlobStorage}]
[--tags [TAGS [TAGS ...]]]
[--custom-domain CUSTOM_DOMAIN]
[--encryption-services {blob,file,table,queue} [{blob,file,table,queue} ...]]
[--access-tier {Hot,Cool}]
[--https-only [{true,false}]]
[--file-aad [{true,false}]]
[--hierarchical-namespace [{true,false}]]
[--bypass {None,Logging,Metrics,AzureServices} [{None,Logging,Metrics,AzureServices} ...]]
[--default-action {Allow,Deny}]
[--assign-identity]
[--subscription _SUBSCRIPTION]
</pre></div></div>
<p>You can list locations from <tt>az account list-locations</tt>, which lists the name to refer to in the <tt>--location</tt> argument:</p>
<div>
<div>
<pre class="source">$ az account list-locations -o table
DisplayName Latitude Longitude Name
------------------- ---------- ----------- ------------------
East Asia 22.267 114.188 eastasia
Southeast Asia 1.283 103.833 southeastasia
Central US 41.5908 -93.6208 centralus
East US 37.3719 -79.8164 eastus
East US 2 36.6681 -78.3889 eastus2
West US 37.783 -122.417 westus
North Central US 41.8819 -87.6278 northcentralus
South Central US 29.4167 -98.5 southcentralus
North Europe 53.3478 -6.2597 northeurope
West Europe 52.3667 4.9 westeurope
Japan West 34.6939 135.5022 japanwest
Japan East 35.68 139.77 japaneast
Brazil South -23.55 -46.633 brazilsouth
Australia East -33.86 151.2094 australiaeast
Australia Southeast -37.8136 144.9631 australiasoutheast
South India 12.9822 80.1636 southindia
Central India 18.5822 73.9197 centralindia
West India 19.088 72.868 westindia
Canada Central 43.653 -79.383 canadacentral
Canada East 46.817 -71.217 canadaeast
UK South 50.941 -0.799 uksouth
UK West 53.427 -3.084 ukwest
West Central US 40.890 -110.234 westcentralus
West US 2 47.233 -119.852 westus2
Korea Central 37.5665 126.9780 koreacentral
Korea South 35.1796 129.0756 koreasouth
France Central 46.3772 2.3730 francecentral
France South 43.8345 2.1972 francesouth
Australia Central -35.3075 149.1244 australiacentral
Australia Central 2 -35.3075 149.1244 australiacentral2
</pre></div></div>
<p>Once a location has been chosen, create the account</p>
<div>
<div>
<pre class="source">az storage account create --verbose \
--name abfswales1 \
--resource-group devteam2 \
--kind StorageV2 \
--hierarchical-namespace true \
--location ukwest \
--sku Standard_LRS \
--https-only true \
--encryption-services blob \
--access-tier Hot \
--tags owner=engineering \
--assign-identity \
--output jsonc
</pre></div></div>
<p>The output of the command is a JSON file, whose <tt>primaryEndpoints</tt> command includes the name of the store endpoint:</p>
<div>
<div>
<pre class="source">{
&quot;primaryEndpoints&quot;: {
&quot;blob&quot;: &quot;https://abfswales1.blob.core.windows.net/&quot;,
&quot;dfs&quot;: &quot;https://abfswales1.dfs.core.windows.net/&quot;,
&quot;file&quot;: &quot;https://abfswales1.file.core.windows.net/&quot;,
&quot;queue&quot;: &quot;https://abfswales1.queue.core.windows.net/&quot;,
&quot;table&quot;: &quot;https://abfswales1.table.core.windows.net/&quot;,
&quot;web&quot;: &quot;https://abfswales1.z35.web.core.windows.net/&quot;
}
}
</pre></div></div>
<p>The <tt>abfswales1.dfs.core.windows.net</tt> account is the name by which the storage account will be referred to.</p>
<p>Now ask for the connection string to the store, which contains the account key</p>
<div>
<div>
<pre class="source">az storage account show-connection-string --name abfswales1
{
&quot;connectionString&quot;: &quot;DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfswales1;AccountKey=ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==&quot;
}
</pre></div></div>
<p>You then need to add the access key to your <tt>core-site.xml</tt>, JCEKs file or use your cluster management tool to set it the option <tt>fs.azure.account.key.STORAGE-ACCOUNT</tt> to this value.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.account.key.abfswales1.dfs.core.windows.net&lt;/name&gt;
&lt;value&gt;ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<div class="section">
<h4><a name="Creation_through_the_Azure_Portal"></a>Creation through the Azure Portal</h4>
<p>Creation through the portal is covered in <a class="externalLink" href="https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-quickstart-create-account">Quickstart: Create an Azure Data Lake Storage Gen2 storage account</a></p>
<p>Key Steps</p>
<ol style="list-style-type: decimal">
<li>Create a new Storage Account in a location which suits you.</li>
<li>&#x201c;Basics&#x201d; Tab: select &#x201c;StorageV2&#x201d;.</li>
<li>&#x201c;Advanced&#x201d; Tab: enable &#x201c;Hierarchical Namespace&#x201d;.</li>
</ol>
<p>You have now created your storage account. Next, get the key for authentication for using the default &#x201c;Shared Key&#x201d; authentication.</p>
<ol style="list-style-type: decimal">
<li>Go to the Azure Portal.</li>
<li>Select &#x201c;Storage Accounts&#x201d;</li>
<li>Select the newly created storage account.</li>
<li>In the list of settings, locate &#x201c;Access Keys&#x201d; and select that.</li>
<li>Copy one of the access keys to the clipboard, add to the XML option, set in cluster management tools, Hadoop JCEKS file or KMS store.</li>
</ol></div></div>
<div class="section">
<h3><a name="Creating_a_new_container"></a><a name="new_container"></a> Creating a new container</h3>
<p>An Azure storage account can have multiple containers, each with the container name as the userinfo field of the URI used to reference it.</p>
<p>For example, the container &#x201c;container1&#x201d; in the storage account just created will have the URL <tt>abfs://container1@abfswales1.dfs.core.windows.net/</tt></p>
<p>You can create a new container through the ABFS connector, by setting the option <tt>fs.azure.createRemoteFileSystemDuringInitialization</tt> to <tt>true</tt>. Though the same is not supported when AuthType is SAS.</p>
<p>If the container does not exist, an attempt to list it with <tt>hadoop fs -ls</tt> will fail</p>
<div>
<div>
<pre class="source">$ hadoop fs -ls abfs://container1@abfswales1.dfs.core.windows.net/
ls: `abfs://container1@abfswales1.dfs.core.windows.net/': No such file or directory
</pre></div></div>
<p>Enable remote FS creation and the second attempt succeeds, creating the container as it does so:</p>
<div>
<div>
<pre class="source">$ hadoop fs -D fs.azure.createRemoteFileSystemDuringInitialization=true \
-ls abfs://container1@abfswales1.dfs.core.windows.net/
</pre></div></div>
<p>This is useful for creating accounts on the command line, especially before the <tt>az storage</tt> command supports hierarchical namespaces completely.</p></div>
<div class="section">
<h3><a name="Listing_and_examining_containers_of_a_Storage_Account."></a>Listing and examining containers of a Storage Account.</h3>
<p>You can use the <a class="externalLink" href="https://azure.microsoft.com/en-us/features/storage-explorer/">Azure Storage Explorer</a></p></div></div>
<div class="section">
<h2><a name="Configuring_ABFS"></a><a name="configuring"></a> Configuring ABFS</h2>
<p>Any configuration can be specified generally (or as the default when accessing all accounts) or can be tied to a specific account. For example, an OAuth identity can be configured for use regardless of which account is accessed with the property <tt>fs.azure.account.oauth2.client.id</tt> or you can configure an identity to be used only for a specific storage account with <tt>fs.azure.account.oauth2.client.id.&lt;account_name&gt;.dfs.core.windows.net</tt>.</p>
<p>This is shown in the Authentication section.</p></div>
<div class="section">
<h2><a name="Authentication"></a><a name="authentication"></a> Authentication</h2>
<p>Authentication for ABFS is ultimately granted by <a class="externalLink" href="https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-scenarios">Azure Active Directory</a>.</p>
<p>The concepts covered there are beyond the scope of this document to cover; developers are expected to have read and understood the concepts therein to take advantage of the different authentication mechanisms.</p>
<p>What is covered here, briefly, is how to configure the ABFS client to authenticate in different deployment situations.</p>
<p>The ABFS client can be deployed in different ways, with its authentication needs driven by them.</p>
<ol style="list-style-type: decimal">
<li>With the storage account&#x2019;s authentication secret in the configuration: &#x201c;Shared Key&#x201d;.</li>
<li>Using OAuth 2.0 tokens of one form or another.</li>
<li>Deployed in-Azure with the Azure VMs providing OAuth 2.0 tokens to the application, &#x201c;Managed Instance&#x201d;.</li>
<li>Using Shared Access Signature (SAS) tokens provided by a custom implementation of the SASTokenProvider interface.</li>
</ol>
<p>What can be changed is what secrets/credentials are used to authenticate the caller.</p>
<p>The authentication mechanism is set in <tt>fs.azure.account.auth.type</tt> (or the account specific variant). The possible values are SharedKey, OAuth, Custom and SAS. For the various OAuth options use the config <tt>fs.azure.account .oauth.provider.type</tt>. Following are the implementations supported ClientCredsTokenProvider, UserPasswordTokenProvider, MsiTokenProvider and RefreshTokenBasedTokenProvider. An IllegalArgumentException is thrown if the specified provider type is not one of the supported.</p>
<p>All secrets can be stored in JCEKS files. These are encrypted and password protected &#x2014;use them or a compatible Hadoop Key Management Store wherever possible</p>
<div class="section">
<h3><a name="AAD_Token_fetch_retries"></a><a name="aad-token-fetch-retry-logic"></a> AAD Token fetch retries</h3>
<p>The exponential retry policy used for the AAD token fetch retries can be tuned with the following configurations. * <tt>fs.azure.oauth.token.fetch.retry.max.retries</tt>: Sets the maximum number of retries. Default value is 5. * <tt>fs.azure.oauth.token.fetch.retry.min.backoff.interval</tt>: Minimum back-off interval. Added to the retry interval computed from delta backoff. By default this si set as 0. Set the interval in milli seconds. * <tt>fs.azure.oauth.token.fetch.retry.max.backoff.interval</tt>: Maximum back-off interval. Default value is 60000 (sixty seconds). Set the interval in milli seconds. * <tt>fs.azure.oauth.token.fetch.retry.delta.backoff</tt>: Back-off interval between retries. Multiples of this timespan are used for subsequent retry attempts . The default value is 2.</p></div>
<div class="section">
<h3><a name="Default:_Shared_Key"></a><a name="shared-key-auth"></a> Default: Shared Key</h3>
<p>This is the simplest authentication mechanism of account + password.</p>
<p>The account name is inferred from the URL; the password, &#x201c;key&#x201d;, retrieved from the XML/JCECKs configuration files.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.account.auth.type.abfswales1.dfs.core.windows.net&lt;/name&gt;
&lt;value&gt;SharedKey&lt;/value&gt;
&lt;description&gt;
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.key.abfswales1.dfs.core.windows.net&lt;/name&gt;
&lt;value&gt;ZGlkIHlvdSByZWFsbHkgdGhpbmsgSSB3YXMgZ29pbmcgdG8gcHV0IGEga2V5IGluIGhlcmU/IA==&lt;/value&gt;
&lt;description&gt;
The secret password. Never share these.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p><i>Note</i>: The source of the account key can be changed through a custom key provider; one exists to execute a shell script to retrieve it.</p>
<p>A custom key provider class can be provided with the config <tt>fs.azure.account.keyprovider</tt>. If a key provider class is specified the same will be used to get account key. Otherwise the Simple key provider will be used which will use the key specified for the config <tt>fs.azure.account.key</tt>.</p>
<p>To retrieve using shell script, specify the path to the script for the config <tt>fs.azure.shellkeyprovider.script</tt>. ShellDecryptionKeyProvider class use the script specified to retrieve the key.</p></div>
<div class="section">
<h3><a name="OAuth_2.0_Client_Credentials"></a><a name="oauth-client-credentials"></a> OAuth 2.0 Client Credentials</h3>
<p>OAuth 2.0 credentials of (client id, client secret, endpoint) are provided in the configuration/JCEKS file.</p>
<p>The specifics of this process is covered in <a href="../hadoop-azure-datalake/index.html#Configuring_Credentials_and_FileSystem">hadoop-azure-datalake</a>; the key names are slightly different here.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.account.auth.type&lt;/name&gt;
&lt;value&gt;OAuth&lt;/value&gt;
&lt;description&gt;
Use OAuth authentication
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth.provider.type&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider&lt;/value&gt;
&lt;description&gt;
Use client credentials
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.client.endpoint&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
URL of OAuth endpoint
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.client.id&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
Client ID
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.client.secret&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
Secret
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="OAuth_2.0:_Username_and_Password"></a><a name="oauth-user-and-passwd"></a> OAuth 2.0: Username and Password</h3>
<p>An OAuth 2.0 endpoint, username and password are provided in the configuration/JCEKS file.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.account.auth.type&lt;/name&gt;
&lt;value&gt;OAuth&lt;/value&gt;
&lt;description&gt;
Use OAuth authentication
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth.provider.type&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.azurebfs.oauth2.UserPasswordTokenProvider&lt;/value&gt;
&lt;description&gt;
Use user and password
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.client.endpoint&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
URL of OAuth 2.0 endpoint
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.user.name&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
username
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.user.password&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
password for account
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="OAuth_2.0:_Refresh_Token"></a><a name="oauth-refresh-token"></a> OAuth 2.0: Refresh Token</h3>
<p>With an existing Oauth 2.0 token, make a request of the Active Directory endpoint <tt>https://login.microsoftonline.com/Common/oauth2/token</tt> for this token to be refreshed.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.account.auth.type&lt;/name&gt;
&lt;value&gt;OAuth&lt;/value&gt;
&lt;description&gt;
Use OAuth 2.0 authentication
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth.provider.type&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.azurebfs.oauth2.RefreshTokenBasedTokenProvider&lt;/value&gt;
&lt;description&gt;
Use the Refresh Token Provider
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.refresh.token&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
Refresh token
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.refresh.endpoint&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
Refresh token endpoint
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.client.id&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
Optional Client ID
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="Azure_Managed_Identity"></a><a name="managed-identity"></a> Azure Managed Identity</h3>
<p><a class="externalLink" href="https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview">Azure Managed Identities</a>, formerly &#x201c;Managed Service Identities&#x201d;.</p>
<p>OAuth 2.0 tokens are issued by a special endpoint only accessible from the executing VM (<tt>http://169.254.169.254/metadata/identity/oauth2/token</tt>). The issued credentials can be used to authenticate.</p>
<p>The Azure Portal/CLI is used to create the service identity.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.account.auth.type&lt;/name&gt;
&lt;value&gt;OAuth&lt;/value&gt;
&lt;description&gt;
Use OAuth authentication
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth.provider.type&lt;/name&gt;
&lt;value&gt;org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider&lt;/value&gt;
&lt;description&gt;
Use MSI for issuing OAuth tokens
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.msi.tenant&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
Optional MSI Tenant ID
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.msi.endpoint&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
MSI endpoint
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth2.client.id&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
Optional Client ID
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
</div>
<div class="section">
<h3><a name="Custom_OAuth_2.0_Token_Provider"></a>Custom OAuth 2.0 Token Provider</h3>
<p>A Custom OAuth 2.0 token provider supplies the ABFS connector with an OAuth 2.0 token when its <tt>getAccessToken()</tt> method is invoked.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.account.auth.type&lt;/name&gt;
&lt;value&gt;Custom&lt;/value&gt;
&lt;description&gt;
Custom Authentication
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.account.oauth.provider.type&lt;/name&gt;
&lt;value&gt;&lt;/value&gt;
&lt;description&gt;
classname of Custom Authentication Provider
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>The declared class must implement <tt>org.apache.hadoop.fs.azurebfs.extensions.CustomTokenProviderAdaptee</tt> and optionally <tt>org.apache.hadoop.fs.azurebfs.extensions.BoundDTExtension</tt>.</p>
<p>The declared class also holds responsibility to implement retry logic while fetching access tokens.</p></div>
<div class="section">
<h3><a name="Delegation_Token_Provider"></a><a name="delegationtokensupportconfigoptions"></a> Delegation Token Provider</h3>
<p>A delegation token provider supplies the ABFS connector with delegation tokens, helps renew and cancel the tokens by implementing the CustomDelegationTokenManager interface.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.enable.delegation.token&lt;/name&gt;
&lt;value&gt;true&lt;/value&gt;
&lt;description&gt;Make this true to use delegation token provider&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.delegation.token.provider.type&lt;/name&gt;
&lt;value&gt;{fully-qualified-class-name-for-implementation-of-CustomDelegationTokenManager-interface}&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>In case delegation token is enabled, and the config <tt>fs.azure.delegation.token .provider.type</tt> is not provided then an IlleagalArgumentException is thrown.</p></div>
<div class="section">
<h3><a name="Shared_Access_Signature_.28SAS.29_Token_Provider"></a>Shared Access Signature (SAS) Token Provider</h3>
<p>A Shared Access Signature (SAS) token provider supplies the ABFS connector with SAS tokens by implementing the SASTokenProvider interface.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.account.auth.type&lt;/name&gt;
&lt;value&gt;SAS&lt;/value&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.sas.token.provider.type&lt;/name&gt;
&lt;value&gt;{fully-qualified-class-name-for-implementation-of-SASTokenProvider-interface}&lt;/value&gt;
&lt;/property&gt;
</pre></div></div>
<p>The declared class must implement <tt>org.apache.hadoop.fs.azurebfs.extensions.SASTokenProvider</tt>.</p></div></div>
<div class="section">
<h2><a name="Technical_notes"></a><a name="technical"></a> Technical notes</h2>
<div class="section">
<h3><a name="Proxy_setup"></a><a name="proxy"></a> Proxy setup</h3>
<p>The connector uses the JVM proxy settings to control its proxy setup.</p>
<p>See The <a class="externalLink" href="https://docs.oracle.com/javase/8/docs/technotes/guides/net/proxies.html">Oracle Java documentation</a> for the options to set.</p>
<p>As the connector uses HTTPS by default, the <tt>https.proxyHost</tt> and <tt>https.proxyPort</tt> options are those which must be configured.</p>
<p>In MapReduce jobs, including distcp, the proxy options must be set in both the <tt>mapreduce.map.java.opts</tt> and <tt>mapreduce.reduce.java.opts</tt>.</p>
<div>
<div>
<pre class="source"># this variable is only here to avoid typing the same values twice.
# It's name is not important.
export DISTCP_PROXY_OPTS=&quot;-Dhttps.proxyHost=web-proxy.example.com -Dhttps.proxyPort=80&quot;
hadoop distcp \
-D mapreduce.map.java.opts=&quot;$DISTCP_PROXY_OPTS&quot; \
-D mapreduce.reduce.java.opts=&quot;$DISTCP_PROXY_OPTS&quot; \
-update -skipcrccheck -numListstatusThreads 40 \
hdfs://namenode:8020/users/alice abfs://backups@account.dfs.core.windows.net/users/alice
</pre></div></div>
<p>Without these settings, even though access to ADLS may work from the command line, <tt>distcp</tt> access can fail with network errors.</p></div>
<div class="section">
<h3><a name="Security"></a><a name="security"></a> Security</h3>
<p>As with other object stores, login secrets are valuable pieces of information. Organizations should have a process for safely sharing them.</p></div>
<div class="section">
<h3><a name="Limitations_of_the_ABFS_connector"></a><a name="limitations"></a> Limitations of the ABFS connector</h3>
<ul>
<li>File last access time is not tracked.</li>
<li>Extended attributes are not supported.</li>
<li>File Checksums are not supported.</li>
<li>The <tt>Syncable</tt> interfaces <tt>hsync()</tt> and <tt>hflush()</tt> operations are supported if <tt>fs.azure.enable.flush</tt> is set to true (default=true). With the Wasb connector, this limited the number of times either call could be made to 50,000 <a class="externalLink" href="https://issues.apache.org/jira/browse/HADOOP-15478">HADOOP-15478</a>. If abfs has the a similar limit, then excessive use of sync/flush may cause problems.</li>
</ul></div>
<div class="section">
<h3><a name="Consistency_and_Concurrency"></a><a name="consistency"></a> Consistency and Concurrency</h3>
<p>As with all Azure storage services, the Azure Datalake Gen 2 store offers a fully consistent view of the store, with complete Create, Read, Update, and Delete consistency for data and metadata. (Compare and contrast with S3 which only offers Create consistency; S3Guard adds CRUD to metadata, but not the underlying data).</p></div>
<div class="section">
<h3><a name="Performance_and_Scalability"></a><a name="performance"></a> Performance and Scalability</h3>
<p>For containers with hierarchical namespaces, the scalability numbers are, in Big-O-notation, as follows:</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> Operation </th>
<th> Scalability </th></tr>
</thead><tbody>
<tr class="b">
<td> File Rename </td>
<td> <tt>O(1)</tt> </td></tr>
<tr class="a">
<td> File Delete </td>
<td> <tt>O(1)</tt> </td></tr>
<tr class="b">
<td> Directory Rename:</td>
<td> <tt>O(1)</tt> </td></tr>
<tr class="a">
<td> Directory Delete </td>
<td> <tt>O(1)</tt> </td></tr>
</tbody>
</table>
<p>For non-namespace stores, the scalability becomes:</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th> Operation </th>
<th> Scalability </th></tr>
</thead><tbody>
<tr class="b">
<td> File Rename </td>
<td> <tt>O(1)</tt> </td></tr>
<tr class="a">
<td> File Delete </td>
<td> <tt>O(1)</tt> </td></tr>
<tr class="b">
<td> Directory Rename:</td>
<td> <tt>O(files)</tt> </td></tr>
<tr class="a">
<td> Directory Delete </td>
<td> <tt>O(files)</tt> </td></tr>
</tbody>
</table>
<p>That is: the more files there are, the slower directory operations get.</p>
<p>Further reading: <a class="externalLink" href="https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets?toc=%2fazure%2fstorage%2fqueues%2ftoc.json">Azure Storage Scalability Targets</a></p></div>
<div class="section">
<h3><a name="Extensibility"></a><a name="extensibility"></a> Extensibility</h3>
<p>The ABFS connector supports a number of limited-private/unstable extension points for third-parties to integrate their authentication and authorization services into the ABFS client.</p>
<ul>
<li><tt>CustomDelegationTokenManager</tt> : adds ability to issue Hadoop Delegation Tokens.</li>
<li><tt>SASTokenProvider</tt>: allows for custom provision of Azure Storage Shared Access Signature (SAS) tokens.</li>
<li><tt>CustomTokenProviderAdaptee</tt>: allows for custom provision of Azure OAuth tokens.</li>
<li><tt>KeyProvider</tt>.</li>
</ul>
<p>Consult the source in <tt>org.apache.hadoop.fs.azurebfs.extensions</tt> and all associated tests to see how to make use of these extension points.</p>
<p><i>Warning</i> These extension points are unstable.</p></div></div>
<div class="section">
<h2><a name="Other_configuration_options"></a><a href="options"></a> Other configuration options</h2>
<p>Consult the javadocs for <tt>org.apache.hadoop.fs.azurebfs.constants.ConfigurationKeys</tt>, <tt>org.apache.hadoop.fs.azurebfs.constants.FileSystemConfigurations</tt> and <tt>org.apache.hadoop.fs.azurebfs.AbfsConfiguration</tt> for the full list of configuration options and their default values.</p>
<div class="section">
<h3><a name="Flush_Options"></a><a name="flushconfigoptions"></a> Flush Options</h3>
<div class="section">
<h4><a name="a1._Azure_Blob_File_System_Flush_Options"></a><a name="abfsflushconfigoptions"></a> 1. Azure Blob File System Flush Options</h4>
<p>Config <tt>fs.azure.enable.flush</tt> provides an option to render ABFS flush APIs - HFlush() and HSync() to be no-op. By default, this config will be set to true.</p>
<p>Both the APIs will ensure that data is persisted.</p></div>
<div class="section">
<h4><a name="a2._OutputStream_Flush_Options"></a><a name="outputstreamflushconfigoptions"></a> 2. OutputStream Flush Options</h4>
<p>Config <tt>fs.azure.disable.outputstream.flush</tt> provides an option to render OutputStream Flush() API to be a no-op in AbfsOutputStream. By default, this config will be set to true.</p>
<p>Hflush() being the only documented API that can provide persistent data transfer, Flush() also attempting to persist buffered data will lead to performance issues.</p></div></div>
<div class="section">
<h3><a name="HNS_Check_Options"></a><a name="hnscheckconfigoptions"></a> HNS Check Options</h3>
<p>Config <tt>fs.azure.account.hns.enabled</tt> provides an option to specify whether the storage account is HNS enabled or not. In case the config is not provided, a server call is made to check the same.</p></div>
<div class="section">
<h3><a name="Access_Options"></a><a name="flushconfigoptions"></a> Access Options</h3>
<p>Config <tt>fs.azure.enable.check.access</tt> needs to be set true to enable the AzureBlobFileSystem.access().</p></div>
<div class="section">
<h3><a name="Operation_Idempotency"></a><a name="idempotency"></a> Operation Idempotency</h3>
<p>Requests failing due to server timeouts and network failures will be retried. PUT/POST operations are idempotent and need no specific handling except for Rename and Delete operations.</p>
<p>Rename idempotency checks are made by ensuring the LastModifiedTime on destination is recent if source path is found to be non-existent on retry.</p>
<p>Delete is considered to be idempotent by default if the target does not exist on retry.</p></div>
<div class="section">
<h3><a name="Primary_User_Group_Options"></a><a name="featureconfigoptions"></a> Primary User Group Options</h3>
<p>The group name which is part of FileStatus and AclStatus will be set the same as the username if the following config is set to true <tt>fs.azure.skipUserGroupMetadataDuringInitialization</tt>.</p></div>
<div class="section">
<h3><a name="IO_Options"></a><a name="ioconfigoptions"></a> IO Options</h3>
<p>The following configs are related to read and write operations.</p>
<p><tt>fs.azure.io.retry.max.retries</tt>: Sets the number of retries for IO operations. Currently this is used only for the server call retry logic. Used within AbfsClient class as part of the ExponentialRetryPolicy. The value should be greater than or equal to 0.</p>
<p><tt>fs.azure.write.request.size</tt>: To set the write buffer size. Specify the value in bytes. The value should be between 16384 to 104857600 both inclusive (16 KB to 100 MB). The default value will be 8388608 (8 MB).</p>
<p><tt>fs.azure.read.request.size</tt>: To set the read buffer size.Specify the value in bytes. The value should be between 16384 to 104857600 both inclusive (16 KB to 100 MB). The default value will be 4194304 (4 MB).</p>
<p><tt>fs.azure.read.alwaysReadBufferSize</tt>: Read request size configured by <tt>fs.azure.read.request.size</tt> will be honoured only when the reads done are in sequential pattern. When the read pattern is detected to be random, read size will be same as the buffer length provided by the calling process. This config when set to true will force random reads to also read in same request sizes as sequential reads. This is a means to have same read patterns as of ADLS Gen1, as it does not differentiate read patterns and always reads by the configured read request size. The default value for this config will be false, where reads for the provided buffer length is done when random read pattern is detected.</p>
<p><tt>fs.azure.readaheadqueue.depth</tt>: Sets the readahead queue depth in AbfsInputStream. In case the set value is negative the read ahead queue depth will be set as Runtime.getRuntime().availableProcessors(). By default the value will be -1. To disable readaheads, set this value to 0. If your workload is doing only random reads (non-sequential) or you are seeing throttling, you may try setting this value to 0.</p>
<p><tt>fs.azure.read.readahead.blocksize</tt>: To set the read buffer size for the read aheads. Specify the value in bytes. The value should be between 16384 to 104857600 both inclusive (16 KB to 100 MB). The default value will be 4194304 (4 MB).</p>
<p><tt>fs.azure.buffered.pread.disable</tt>: By default the positional read API will do a seek and read on input stream. This read will fill the buffer cache in AbfsInputStream and update the cursor positions. If this optimization is true it will skip usage of buffer and do a lock free REST call for reading from blob. This optimization is very much helpful for HBase kind of short random read over a shared AbfsInputStream instance. Note: This is not a config which can be set at cluster level. It can be used as an option on FutureDataInputStreamBuilder. See FileSystem#openFile(Path path)</p>
<p>To run under limited memory situations configure the following. Especially when there are too many writes from the same process.</p>
<p><tt>fs.azure.write.max.concurrent.requests</tt>: To set the maximum concurrent write requests from an AbfsOutputStream instance to server at any point of time. Effectively this will be the threadpool size within the AbfsOutputStream instance. Set the value in between 1 to 8 both inclusive.</p>
<p><tt>fs.azure.write.max.requests.to.queue</tt>: To set the maximum write requests that can be queued. Memory consumption of AbfsOutputStream instance can be tuned with this config considering each queued request holds a buffer. Set the value 3 or 4 times the value set for s.azure.write.max.concurrent.requests.</p></div>
<div class="section">
<h3><a name="Security_Options"></a><a name="securityconfigoptions"></a> Security Options</h3>
<p><tt>fs.azure.always.use.https</tt>: Enforces to use HTTPS instead of HTTP when the flag is made true. Irrespective of the flag, AbfsClient will use HTTPS if the secure scheme (ABFSS) is used or OAuth is used for authentication. By default this will be set to true.</p>
<p><tt>fs.azure.ssl.channel.mode</tt>: Initializing DelegatingSSLSocketFactory with the specified SSL channel mode. Value should be of the enum DelegatingSSLSocketFactory.SSLChannelMode. The default value will be DelegatingSSLSocketFactory.SSLChannelMode.Default.</p></div>
<div class="section">
<h3><a name="Server_Options"></a><a name="serverconfigoptions"></a> Server Options</h3>
<p>When the config <tt>fs.azure.io.read.tolerate.concurrent.append</tt> is made true, the If-Match header sent to the server for read calls will be set as * otherwise the same will be set with ETag. This is basically a mechanism in place to handle the reads with optimistic concurrency. Please refer the following links for further information. 1. <a class="externalLink" href="https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/read">https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/read</a> 2. <a class="externalLink" href="https://azure.microsoft.com/de-de/blog/managing-concurrency-in-microsoft-azure-storage-2/">https://azure.microsoft.com/de-de/blog/managing-concurrency-in-microsoft-azure-storage-2/</a></p>
<p>listStatus API fetches the FileStatus information from server in a page by page manner. The config <tt>fs.azure.list.max.results</tt> used to set the maxResults URI param which sets the pagesize(maximum results per call). The value should be &gt; 0. By default this will be 5000. Server has a maximum value for this parameter as 5000. So even if the config is above 5000 the response will only contain 5000 entries. Please refer the following link for further information. <a class="externalLink" href="https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/list">https://docs.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/list</a></p></div>
<div class="section">
<h3><a name="Throttling_Options"></a><a name="throttlingconfigoptions"></a> Throttling Options</h3>
<p>ABFS driver has the capability to throttle read and write operations to achieve maximum throughput by minimizing errors. The errors occur when the account ingress or egress limits are exceeded and, the server-side throttles requests. Server-side throttling causes the retry policy to be used, but the retry policy sleeps for long periods of time causing the total ingress or egress throughput to be as much as 35% lower than optimal. The retry policy is also after the fact, in that it applies after a request fails. On the other hand, the client-side throttling implemented here happens before requests are made and sleeps just enough to minimize errors, allowing optimal ingress and/or egress throughput. By default the throttling mechanism is enabled in the driver. The same can be disabled by setting the config <tt>fs.azure.enable.autothrottling</tt> to false.</p></div>
<div class="section">
<h3><a name="Rename_Options"></a><a name="renameconfigoptions"></a> Rename Options</h3>
<p><tt>fs.azure.atomic.rename.key</tt>: Directories for atomic rename support can be specified comma separated in this config. The driver prints the following warning log if the source of the rename belongs to one of the configured directories. &#x201c;The atomic rename feature is not supported by the ABFS scheme ; however, rename, create and delete operations are atomic if Namespace is enabled for your Azure Storage account.&#x201d; The directories can be specified as comma separated values. By default the value is &#x201c;/hbase&#x201d;</p></div>
<div class="section">
<h3><a name="Infinite_Lease_Options"></a><a name="infiniteleaseoptions"></a> Infinite Lease Options</h3>
<p><tt>fs.azure.infinite-lease.directories</tt>: Directories for infinite lease support can be specified comma separated in this config. By default, multiple clients will be able to write to the same file simultaneously. When writing to files contained within the directories specified in this config, the client will obtain a lease on the file that will prevent any other clients from writing to the file. When the output stream is closed, the lease will be released. To revoke a client&#x2019;s write access for a file, the AzureBlobFilesystem breakLease method may be called. If the client dies before the file can be closed and the lease released, breakLease will need to be called before another client will be able to write to the file.</p>
<p><tt>fs.azure.lease.threads</tt>: This is the size of the thread pool that will be used for lease operations for infinite lease directories. By default the value is 0, so it must be set to at least 1 to support infinite lease directories.</p></div>
<div class="section">
<h3><a name="Perf_Options"></a><a name="perfoptions"></a> Perf Options</h3>
<div class="section">
<h4><a name="a1._HTTP_Request_Tracking_Options"></a><a name="abfstracklatencyoptions"></a> 1. HTTP Request Tracking Options</h4>
<p>If you set <tt>fs.azure.abfs.latency.track</tt> to <tt>true</tt>, the module starts tracking the performance metrics of ABFS HTTP traffic. To obtain these numbers on your machine or cluster, you will also need to enable debug logging for the <tt>AbfsPerfTracker</tt> class in your <tt>log4j</tt> config. A typical perf log line appears like:</p>
<div>
<div>
<pre class="source">h=KARMA t=2019-10-25T20:21:14.518Z a=abfstest01.dfs.core.windows.net
c=abfs-testcontainer-84828169-6488-4a62-a875-1e674275a29f cr=delete ce=deletePath
r=Succeeded l=32 ls=32 lc=1 s=200 e= ci=95121dae-70a8-4187-b067-614091034558
ri=97effdcf-201f-0097-2d71-8bae00000000 ct=0 st=0 rt=0 bs=0 br=0 m=DELETE
u=https%3A%2F%2Fabfstest01.dfs.core.windows.net%2Ftestcontainer%2Ftest%3Ftimeout%3D90%26recursive%3Dtrue
</pre></div></div>
<p>The fields have the following definitions:</p>
<p><tt>h</tt>: host name <tt>t</tt>: time when this request was logged <tt>a</tt>: Azure storage account name <tt>c</tt>: container name <tt>cr</tt>: name of the caller method <tt>ce</tt>: name of the callee method <tt>r</tt>: result (Succeeded/Failed) <tt>l</tt>: latency (time spent in callee) <tt>ls</tt>: latency sum (aggregate time spent in caller; logged when there are multiple callees; logged with the last callee) <tt>lc</tt>: latency count (number of callees; logged when there are multiple callees; logged with the last callee) <tt>s</tt>: HTTP Status code <tt>e</tt>: Error code <tt>ci</tt>: client request ID <tt>ri</tt>: server request ID <tt>ct</tt>: connection time in milliseconds <tt>st</tt>: sending time in milliseconds <tt>rt</tt>: receiving time in milliseconds <tt>bs</tt>: bytes sent <tt>br</tt>: bytes received <tt>m</tt>: HTTP method (GET, PUT etc) <tt>u</tt>: Encoded HTTP URL</p>
<p>Note that these performance numbers are also sent back to the ADLS Gen 2 API endpoints in the <tt>x-ms-abfs-client-latency</tt> HTTP headers in subsequent requests. Azure uses these settings to track their end-to-end latency.</p></div></div></div>
<div class="section">
<h2><a name="Troubleshooting"></a><a name="troubleshooting"></a> Troubleshooting</h2>
<p>The problems associated with the connector usually come down to, in order</p>
<ol style="list-style-type: decimal">
<li>Classpath.</li>
<li>Network setup (proxy etc.).</li>
<li>Authentication and Authorization.</li>
<li>Anything else.</li>
</ol>
<p>If you log <tt>org.apache.hadoop.fs.azurebfs.services</tt> at <tt>DEBUG</tt> then you will see more details about any request which is failing.</p>
<p>One useful tool for debugging connectivity is the <a class="externalLink" href="https://github.com/steveloughran/cloudstore/releases">cloudstore storediag utility</a>.</p>
<p>This validates the classpath, the settings, then tries to work with the filesystem.</p>
<div>
<div>
<pre class="source">bin/hadoop jar cloudstore-0.1-SNAPSHOT.jar storediag abfs://container@account.dfs.core.windows.net/
</pre></div></div>
<ol style="list-style-type: decimal">
<li>If the <tt>storediag</tt> command cannot work with an abfs store, nothing else is likely to.</li>
<li>If the <tt>storediag</tt> store does successfully work, that does not guarantee that the classpath or configuration on the rest of the cluster is also going to work, especially in distributed applications. But it is at least a start.</li>
</ol>
<div class="section">
<h3><a name="ClassNotFoundException:_org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem"></a><tt>ClassNotFoundException: org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem</tt></h3>
<p>The <tt>hadoop-azure</tt> JAR is not on the classpah.</p>
<div>
<div>
<pre class="source">java.lang.RuntimeException: java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2625)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3290)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3322)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:136)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3373)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3341)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:491)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
Caused by: java.lang.ClassNotFoundException:
Class org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2529)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2623)
... 16 more
</pre></div></div>
<p>Tip: if this is happening on the command line, you can turn on debug logging of the hadoop scripts:</p>
<div>
<div>
<pre class="source">export HADOOP_SHELL_SCRIPT_DEBUG=true
</pre></div></div>
<p>If this is happening on an application running within the cluster, it means the cluster (somehow) needs to be configured so that the <tt>hadoop-azure</tt> module and dependencies are on the classpath of deployed applications.</p></div>
<div class="section">
<h3><a name="ClassNotFoundException:_com.microsoft.azure.storage.StorageErrorCode"></a><tt>ClassNotFoundException: com.microsoft.azure.storage.StorageErrorCode</tt></h3>
<p>The <tt>azure-storage</tt> JAR is not on the classpath.</p></div>
<div class="section">
<h3><a name="Server_failed_to_authenticate_the_request"></a><tt>Server failed to authenticate the request</tt></h3>
<p>The request wasn&#x2019;t authenticated while using the default shared-key authentication mechanism.</p>
<div>
<div>
<pre class="source">Operation failed: &quot;Server failed to authenticate the request.
Make sure the value of Authorization header is formed correctly including the signature.&quot;,
403, HEAD, https://account.dfs.core.windows.net/container2?resource=filesystem&amp;timeout=90
at org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:135)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.getFilesystemProperties(AbfsClient.java:209)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFilesystemProperties(AzureBlobFileSystemStore.java:259)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.fileSystemExists(AzureBlobFileSystem.java:859)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:110)
</pre></div></div>
<p>Causes include:</p>
<ul>
<li>Your credentials are incorrect.</li>
<li>Your shared secret has expired. in Azure, this happens automatically</li>
<li>Your shared secret has been revoked.</li>
<li>host/VM clock drift means that your client&#x2019;s clock is out of sync with the Azure servers &#x2014;the call is being rejected as it is either out of date (considered a replay) or from the future. Fix: Check your clocks, etc.</li>
</ul></div>
<div class="section">
<h3><a name="Configuration_property__something_.dfs.core.windows.net_not_found"></a><tt>Configuration property _something_.dfs.core.windows.net not found</tt></h3>
<p>There&#x2019;s no <tt>fs.azure.account.key.</tt> entry in your cluster configuration declaring the access key for the specific account, or you are using the wrong URL</p>
<div>
<div>
<pre class="source">$ hadoop fs -ls abfs://container@abfswales2.dfs.core.windows.net/
ls: Configuration property abfswales2.dfs.core.windows.net not found.
</pre></div></div>
<ul>
<li>Make sure that the URL is correct</li>
<li>Add the missing account key.</li>
</ul></div>
<div class="section">
<h3><a name="No_such_file_or_directory_when_trying_to_list_a_container"></a><tt>No such file or directory when trying to list a container</tt></h3>
<p>There is no container of the given name. Either it has been mistyped or the container needs to be created.</p>
<div>
<div>
<pre class="source">$ hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/
ls: `abfs://container@abfswales1.dfs.core.windows.net/': No such file or directory
</pre></div></div>
<ul>
<li>Make sure that the URL is correct</li>
<li>Create the container if needed</li>
</ul></div>
<div class="section">
<h3><a name="a.E2.80.9CHTTP_connection_to_https:.2F.2Flogin.microsoftonline.com.2Fsomething_failed_for_getting_token_from_AzureAD._Http_response:_200_OK.E2.80.9D"></a>&#x201c;HTTP connection to <a class="externalLink" href="https://login.microsoftonline.com/">https://login.microsoftonline.com/</a><i>something</i> failed for getting token from AzureAD. Http response: 200 OK&#x201d;</h3>
<ul>
<li>it has a content-type <tt>text/html</tt>, <tt>text/plain</tt>, <tt>application/xml</tt></li>
</ul>
<p>The OAuth authentication page didn&#x2019;t fail with an HTTP error code, but it didn&#x2019;t return JSON either</p>
<div>
<div>
<pre class="source">$ bin/hadoop fs -ls abfs://container@abfswales1.dfs.core.windows.net/
...
ls: HTTP Error 200;
url='https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize'
AADToken: HTTP connection to
https://login.microsoftonline.com/02a07549-0a5f-4c91-9d76-53d172a638a2/oauth2/authorize
failed for getting token from AzureAD.
Unexpected response.
Check configuration, URLs and proxy settings.
proxies=none;
requestId='dd9d526c-8b3d-4b3f-a193-0cf021938600';
contentType='text/html; charset=utf-8';
</pre></div></div>
<p>Likely causes are configuration and networking:</p>
<ol style="list-style-type: decimal">
<li>Authentication is failing, the caller is being served up the Azure Active Directory signon page for humans, even though it is a machine calling.</li>
<li>The URL is wrong &#x2014;it is pointing at a web page unrelated to OAuth2.0</li>
<li>There&#x2019;s a proxy server in the way trying to return helpful instructions.</li>
</ol></div>
<div class="section">
<h3><a name="java.io.IOException:_The_ownership_on_the_staging_directory_.2Ftmp.2Fhadoop-yarn.2Fstaging.2Fuser1.2F.staging_is_not_as_expected._It_is_owned_by_.3Cprincipal_id.3E._The_directory_must_be_owned_by_the_submitter_user1_or_user1"></a><tt>java.io.IOException: The ownership on the staging directory /tmp/hadoop-yarn/staging/user1/.staging is not as expected. It is owned by &lt;principal_id&gt;. The directory must be owned by the submitter user1 or user1</tt></h3>
<p>When using <a class="externalLink" href="https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/overview">Azure Managed Identities</a>, the files/directories in ADLS Gen2 by default will be owned by the service principal object id i.e. principal ID &amp; submitting jobs as the local OS user &#x2018;user1&#x2019; results in the above exception.</p>
<p>The fix is to mimic the ownership to the local OS user, by adding the below properties to<tt>core-site.xml</tt>.</p>
<div>
<div>
<pre class="source">&lt;property&gt;
&lt;name&gt;fs.azure.identity.transformer.service.principal.id&lt;/name&gt;
&lt;value&gt;service principal object id&lt;/value&gt;
&lt;description&gt;
An Azure Active Directory object ID (oid) used as the replacement for names contained
in the list specified by &#x201c;fs.azure.identity.transformer.service.principal.substitution.list&#x201d;.
Notice that instead of setting oid, you can also set $superuser here.
&lt;/description&gt;
&lt;/property&gt;
&lt;property&gt;
&lt;name&gt;fs.azure.identity.transformer.service.principal.substitution.list&lt;/name&gt;
&lt;value&gt;user1&lt;/value&gt;
&lt;description&gt;
A comma separated list of names to be replaced with the service principal ID specified by
&#x201c;fs.azure.identity.transformer.service.principal.id&#x201d;. This substitution occurs
when setOwner, setAcl, modifyAclEntries, or removeAclEntries are invoked with identities
contained in the substitution list. Notice that when in non-secure cluster, asterisk symbol *
can be used to match all user/group.
&lt;/description&gt;
&lt;/property&gt;
</pre></div></div>
<p>Once the above properties are configured, <tt>hdfs dfs -ls abfs://container1@abfswales1.dfs.core.windows.net/</tt> shows the ADLS Gen2 files/directories are now owned by &#x2018;user1&#x2019;.</p></div></div>
<div class="section">
<h2><a name="Testing_ABFS"></a><a name="testing"></a> Testing ABFS</h2>
<p>See the relevant section in <a href="testing_azure.html">Testing Azure</a>.</p></div>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2021
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>