| :py:mod:`airflow.providers.microsoft.azure.hooks.data_lake` |
| =========================================================== |
| |
| .. py:module:: airflow.providers.microsoft.azure.hooks.data_lake |
| |
| .. autoapi-nested-parse:: |
| |
| This module contains integration with Azure Data Lake. |
| |
| AzureDataLakeHook communicates via a REST API compatible with WebHDFS. Make sure that a |
| Airflow connection of type `azure_data_lake` exists. Authorization can be done by supplying a |
| login (=Client ID), password (=Client Secret) and extra fields tenant (Tenant) and account_name (Account Name) |
| (see connection `azure_data_lake_default` for an example). |
| |
| |
| |
| Module Contents |
| --------------- |
| |
| Classes |
| ~~~~~~~ |
| |
| .. autoapisummary:: |
| |
| airflow.providers.microsoft.azure.hooks.data_lake.AzureDataLakeHook |
| |
| |
| |
| |
| .. py:class:: AzureDataLakeHook(azure_data_lake_conn_id = default_conn_name) |
| |
| Bases: :py:obj:`airflow.hooks.base.BaseHook` |
| |
| Interacts with Azure Data Lake. |
| |
| Client ID and client secret should be in user and password parameters. |
| Tenant and account name should be extra field as |
| {"tenant": "<TENANT>", "account_name": "ACCOUNT_NAME"}. |
| |
| :param azure_data_lake_conn_id: Reference to the :ref:`Azure Data Lake connection<howto/connection:adl>`. |
| |
| .. py:attribute:: conn_name_attr |
| :annotation: = azure_data_lake_conn_id |
| |
| |
| |
| .. py:attribute:: default_conn_name |
| :annotation: = azure_data_lake_default |
| |
| |
| |
| .. py:attribute:: conn_type |
| :annotation: = azure_data_lake |
| |
| |
| |
| .. py:attribute:: hook_name |
| :annotation: = Azure Data Lake |
| |
| |
| |
| .. py:method:: get_connection_form_widgets() |
| :staticmethod: |
| |
| Returns connection widgets to add to connection form |
| |
| |
| .. py:method:: get_ui_field_behaviour() |
| :staticmethod: |
| |
| Returns custom field behaviour |
| |
| |
| .. py:method:: get_conn(self) |
| |
| Return a AzureDLFileSystem object. |
| |
| |
| .. py:method:: check_for_file(self, file_path) |
| |
| Check if a file exists on Azure Data Lake. |
| |
| :param file_path: Path and name of the file. |
| :return: True if the file exists, False otherwise. |
| :rtype: bool |
| |
| |
| .. py:method:: upload_file(self, local_path, remote_path, nthreads = 64, overwrite = True, buffersize = 4194304, blocksize = 4194304, **kwargs) |
| |
| Upload a file to Azure Data Lake. |
| |
| :param local_path: local path. Can be single file, directory (in which case, |
| upload recursively) or glob pattern. Recursive glob patterns using `**` |
| are not supported. |
| :param remote_path: Remote path to upload to; if multiple files, this is the |
| directory root to write within. |
| :param nthreads: Number of threads to use. If None, uses the number of cores. |
| :param overwrite: Whether to forcibly overwrite existing files/directories. |
| If False and remote path is a directory, will quit regardless if any files |
| would be overwritten or not. If True, only matching filenames are actually |
| overwritten. |
| :param buffersize: int [2**22] |
| Number of bytes for internal buffer. This block cannot be bigger than |
| a chunk and cannot be smaller than a block. |
| :param blocksize: int [2**22] |
| Number of bytes for a block. Within each chunk, we write a smaller |
| block for each API call. This block cannot be bigger than a chunk. |
| |
| |
| .. py:method:: download_file(self, local_path, remote_path, nthreads = 64, overwrite = True, buffersize = 4194304, blocksize = 4194304, **kwargs) |
| |
| Download a file from Azure Blob Storage. |
| |
| :param local_path: local path. If downloading a single file, will write to this |
| specific file, unless it is an existing directory, in which case a file is |
| created within it. If downloading multiple files, this is the root |
| directory to write within. Will create directories as required. |
| :param remote_path: remote path/globstring to use to find remote files. |
| Recursive glob patterns using `**` are not supported. |
| :param nthreads: Number of threads to use. If None, uses the number of cores. |
| :param overwrite: Whether to forcibly overwrite existing files/directories. |
| If False and remote path is a directory, will quit regardless if any files |
| would be overwritten or not. If True, only matching filenames are actually |
| overwritten. |
| :param buffersize: int [2**22] |
| Number of bytes for internal buffer. This block cannot be bigger than |
| a chunk and cannot be smaller than a block. |
| :param blocksize: int [2**22] |
| Number of bytes for a block. Within each chunk, we write a smaller |
| block for each API call. This block cannot be bigger than a chunk. |
| |
| |
| .. py:method:: list(self, path) |
| |
| List files in Azure Data Lake Storage |
| |
| :param path: full path/globstring to use to list files in ADLS |
| |
| |
| .. py:method:: remove(self, path, recursive = False, ignore_not_found = True) |
| |
| Remove files in Azure Data Lake Storage |
| |
| :param path: A directory or file to remove in ADLS |
| :param recursive: Whether to loop into directories in the location and remove the files |
| :param ignore_not_found: Whether to raise error if file to delete is not found |
| |
| |
| |