Hadoop Azure Support: ABFS — Azure Data Lake Storage Gen2

Introduction

The hadoop-azure module provides support for the Azure Data Lake Storage Gen2 storage layer through the “abfs” connector

To make it part of Apache Hadoop's default classpath, simply make sure that HADOOP_OPTIONAL_TOOLS in hadoop-env.sh has hadoop-azure in the list.

Features

Read and write data stored in an Azure Blob Storage account.
Fully Consistent view of the storage across all clients.
Can read data written through the wasb: connector.
Present a hierarchical file system view by implementing the standard Hadoop FileSystem interface.
Supports configuration of multiple Azure Blob Storage accounts.
Can act as a source or destination of data in Hadoop MapReduce, Apache Hive, Apache Spark
Tested at scale on both Linux and Windows.
Can be used as a replacement for HDFS on Hadoop clusters deployed in Azure infrastructure.

Limitations

File last access time is not tracked.

Technical notes

Security

Consistency and Concurrency

TODO: complete/review

The abfs client has a fully consistent view of the store, which has complete Create Read Update and Delete consistency for data and metadata. (Compare and contrast with S3 which only offers Create consistency; S3Guard adds CRUD to metadata, but not the underlying data).

Performance

TODO: check these.

File Rename: O(1).
Directory Rename: O(files).
Directory Delete: O(files).

Configuring ABFS

Any configuration can be specified generally (or as the default when accessing all accounts) or can be tied to s a specific account. For example, an OAuth identity can be configured for use regardless of which account is accessed with the property “fs.azure.account.oauth2.client.id” or you can configure an identity to be used only for a specific storage account with “fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net”.

Note that it doesn't make sense to do this with some properties, like shared keys that are inherently account-specific.

Flush Options

1. Azure Blob File System Flush Options

Config fs.azure.enable.flush provides an option to render ABFS flush APIs - HFlush() and HSync() to be no-op. By default, this config will be set to true.

Both the APIs will ensure that data is persisted.

2. OutputStream Flush Options

Config fs.azure.disable.outputstream.flush provides an option to render OutputStream Flush() API to be a no-op in AbfsOutputStream. By default, this config will be set to true.

Hflush() being the only documented API that can provide persistent data transfer, Flush() also attempting to persist buffered data will lead to performance issues.

Testing ABFS

Access Options

Config fs.azure.enable.check.access needs to be set true to enable the AzureBlobFileSystem.access().

See the relevant section in Testing Azure.

References

A closer look at Azure Data Lake Storage Gen2; MSDN Article from June 28, 2018.