blob: c5606015585741d75dbef57916faa8771f3facb1 [file] [log] [blame]
:py:mod:`airflow.providers.microsoft.azure.hooks.data_lake`
===========================================================
.. py:module:: airflow.providers.microsoft.azure.hooks.data_lake
.. autoapi-nested-parse::
This module contains integration with Azure Data Lake.
AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a
Airflow connection of type `azure_data_lake` exists. Authorization can be done by supplying a
login (=Client ID), password (=Client Secret) and extra fields tenant (Tenant) and account_name (Account Name)
(see connection `azure_data_lake_default` for an example).
Module Contents
---------------
Classes
~~~~~~~
.. autoapisummary::
airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeHook
.. py:class:: AzureDataLakeHook(azure_data_lake_conn_id = default_conn_name)
Bases: :py:obj:`airflow.hooks.base.BaseHook`
Interacts with Azure Data Lake.
Client ID and client secret should be in user and password parameters.
Tenant and account name should be extra field as
{"tenant": "<TENANT>", "account_name": "ACCOUNT_NAME"}.
:param azure_data_lake_conn_id: Reference to the :ref:`Azure Data Lake connection<howto/connection:adl>`.
.. py:attribute:: conn_name_attr
:annotation: = azure_data_lake_conn_id
.. py:attribute:: default_conn_name
:annotation: = azure_data_lake_default
.. py:attribute:: conn_type
:annotation: = azure_data_lake
.. py:attribute:: hook_name
:annotation: = Azure Data Lake
.. py:method:: get_connection_form_widgets()
:staticmethod:
Returns connection widgets to add to connection form
.. py:method:: get_ui_field_behaviour()
:staticmethod:
Returns custom field behaviour
.. py:method:: get_conn(self)
Return a AzureDLFileSystem object.
.. py:method:: check_for_file(self, file_path)
Check if a file exists on Azure Data Lake.
:param file_path: Path and name of the file.
:return: True if the file exists, False otherwise.
:rtype: bool
.. py:method:: upload_file(self, local_path, remote_path, nthreads = 64, overwrite = True, buffersize = 4194304, blocksize = 4194304, **kwargs)
Upload a file to Azure Data Lake.
:param local_path: local path. Can be single file, directory (in which case,
upload recursively) or glob pattern. Recursive glob patterns using `**`
are not supported.
:param remote_path: Remote path to upload to; if multiple files, this is the
directory root to write within.
:param nthreads: Number of threads to use. If None, uses the number of cores.
:param overwrite: Whether to forcibly overwrite existing files/directories.
If False and remote path is a directory, will quit regardless if any files
would be overwritten or not. If True, only matching filenames are actually
overwritten.
:param buffersize: int [2**22]
Number of bytes for internal buffer. This block cannot be bigger than
a chunk and cannot be smaller than a block.
:param blocksize: int [2**22]
Number of bytes for a block. Within each chunk, we write a smaller
block for each API call. This block cannot be bigger than a chunk.
.. py:method:: download_file(self, local_path, remote_path, nthreads = 64, overwrite = True, buffersize = 4194304, blocksize = 4194304, **kwargs)
Download a file from Azure Blob Storage.
:param local_path: local path. If downloading a single file, will write to this
specific file, unless it is an existing directory, in which case a file is
created within it. If downloading multiple files, this is the root
directory to write within. Will create directories as required.
:param remote_path: remote path/globstring to use to find remote files.
Recursive glob patterns using `**` are not supported.
:param nthreads: Number of threads to use. If None, uses the number of cores.
:param overwrite: Whether to forcibly overwrite existing files/directories.
If False and remote path is a directory, will quit regardless if any files
would be overwritten or not. If True, only matching filenames are actually
overwritten.
:param buffersize: int [2**22]
Number of bytes for internal buffer. This block cannot be bigger than
a chunk and cannot be smaller than a block.
:param blocksize: int [2**22]
Number of bytes for a block. Within each chunk, we write a smaller
block for each API call. This block cannot be bigger than a chunk.
.. py:method:: list(self, path)
List files in Azure Data Lake Storage
:param path: full path/globstring to use to list files in ADLS
.. py:method:: remove(self, path, recursive = False, ignore_not_found = True)
Remove files in Azure Data Lake Storage
:param path: A directory or file to remove in ADLS
:param recursive: Whether to loop into directories in the location and remove the files
:param ignore_not_found: Whether to raise error if file to delete is not found