The hadoop-azure
module provides support for the Azure Data Lake Storage Gen2 storage layer through the “abfs” connector
To make it part of Apache Hadoop's default classpath, simply make sure that HADOOP_OPTIONAL_TOOLS
in hadoop-env.sh
has hadoop-azure
in the list.
FileSystem
interface.TODO: complete/review
The abfs client has a fully consistent view of the store, which has complete Create Read Update and Delete consistency for data and metadata. (Compare and contrast with S3 which only offers Create consistency; S3Guard adds CRUD to metadata, but not the underlying data).
TODO: check these.
O(1)
.O(files)
.O(files)
.Any configuration can be specified generally (or as the default when accessing all accounts) or can be tied to s a specific account. For example, an OAuth identity can be configured for use regardless of which account is accessed with the property “fs.azure.account.oauth2.client.id” or you can configure an identity to be used only for a specific storage account with “fs.azure.account.oauth2.client.id.<account_name>.dfs.core.windows.net”.
Note that it doesn't make sense to do this with some properties, like shared keys that are inherently account-specific.
Config fs.azure.enable.flush
provides an option to render ABFS flush APIs - HFlush() and HSync() to be no-op. By default, this config will be set to true.
Both the APIs will ensure that data is persisted.
Config fs.azure.disable.outputstream.flush
provides an option to render OutputStream Flush() API to be a no-op in AbfsOutputStream. By default, this config will be set to true.
Hflush() being the only documented API that can provide persistent data transfer, Flush() also attempting to persist buffered data will lead to performance issues.
Config fs.azure.enable.check.access
needs to be set true to enable the AzureBlobFileSystem.access().
See the relevant section in Testing Azure.