blob: da8aa63552df8f3296e2aa9ad4ef00e7dfa10930 [file] [log] [blame]
:mod:`airflow.providers.google.cloud.hooks.gcs`
===============================================
.. py:module:: airflow.providers.google.cloud.hooks.gcs
.. autoapi-nested-parse::
This module contains a Google Cloud Storage hook.
Module Contents
---------------
.. data:: RT
.. data:: T
.. function:: _fallback_object_url_to_object_name_and_bucket_name(object_url_keyword_arg_name='object_url', bucket_name_keyword_arg_name='bucket_name', object_name_keyword_arg_name='object_name') -> Callable[[T], T]
Decorator factory that convert object URL parameter to object name and bucket name parameter.
:param object_url_keyword_arg_name: Name of the object URL parameter
:type object_url_keyword_arg_name: str
:param bucket_name_keyword_arg_name: Name of the bucket name parameter
:type bucket_name_keyword_arg_name: str
:param object_name_keyword_arg_name: Name of the object name parameter
:type object_name_keyword_arg_name: str
:return: Decorator
.. py:class:: GCSHook(gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, google_cloud_storage_conn_id: Optional[str] = None, impersonation_chain: Optional[Union[str, Sequence[str]]] = None)
Bases: :class:`airflow.providers.google.common.hooks.base_google.GoogleBaseHook`
Interact with Google Cloud Storage. This hook uses the Google Cloud
connection.
.. attribute:: _conn
:annotation: :Optional[storage.Client]
.. method:: get_conn(self)
Returns a Google Cloud Storage service object.
.. method:: copy(self, source_bucket: str, source_object: str, destination_bucket: Optional[str] = None, destination_object: Optional[str] = None)
Copies an object from a bucket to another, with renaming if requested.
destination_bucket or destination_object can be omitted, in which case
source bucket/object is used, but not both.
:param source_bucket: The bucket of the object to copy from.
:type source_bucket: str
:param source_object: The object to copy.
:type source_object: str
:param destination_bucket: The destination of the object to copied to.
Can be omitted; then the same bucket is used.
:type destination_bucket: str
:param destination_object: The (renamed) path of the object if given.
Can be omitted; then the same name is used.
:type destination_object: str
.. method:: rewrite(self, source_bucket: str, source_object: str, destination_bucket: str, destination_object: Optional[str] = None)
Has the same functionality as copy, except that will work on files
over 5 TB, as well as when copying between locations and/or storage
classes.
destination_object can be omitted, in which case source_object is used.
:param source_bucket: The bucket of the object to copy from.
:type source_bucket: str
:param source_object: The object to copy.
:type source_object: str
:param destination_bucket: The destination of the object to copied to.
:type destination_bucket: str
:param destination_object: The (renamed) path of the object if given.
Can be omitted; then the same name is used.
:type destination_object: str
.. method:: download(self, object_name: str, bucket_name: Optional[str], filename: Optional[str] = None)
Downloads a file from Google Cloud Storage.
When no filename is supplied, the operator loads the file into memory and returns its
content. When a filename is supplied, it writes the file to the specified location and
returns the location. For file sizes that exceed the available memory it is recommended
to write to a file.
:param bucket_name: The bucket to fetch from.
:type bucket_name: str
:param object_name: The object to fetch.
:type object_name: str
:param filename: If set, a local file path where the file should be written to.
:type filename: str
.. method:: provide_file(self, bucket_name: Optional[str] = None, object_name: Optional[str] = None, object_url: Optional[str] = None)
Downloads the file to a temporary directory and returns a file handle
You can use this method by passing the bucket_name and object_name parameters
or just object_url parameter.
:param bucket_name: The bucket to fetch from.
:type bucket_name: str
:param object_name: The object to fetch.
:type object_name: str
:param object_url: File reference url. Must start with "gs: //"
:type object_url: str
:return: File handler
.. method:: provide_file_and_upload(self, bucket_name: Optional[str] = None, object_name: Optional[str] = None, object_url: Optional[str] = None)
Creates temporary file, returns a file handle and uploads the files content
on close.
You can use this method by passing the bucket_name and object_name parameters
or just object_url parameter.
:param bucket_name: The bucket to fetch from.
:type bucket_name: str
:param object_name: The object to fetch.
:type object_name: str
:param object_url: File reference url. Must start with "gs: //"
:type object_url: str
:return: File handler
.. method:: upload(self, bucket_name: str, object_name: str, filename: Optional[str] = None, data: Optional[Union[str, bytes]] = None, mime_type: Optional[str] = None, gzip: bool = False, encoding: str = 'utf-8')
Uploads a local file or file data as string or bytes to Google Cloud Storage.
:param bucket_name: The bucket to upload to.
:type bucket_name: str
:param object_name: The object name to set when uploading the file.
:type object_name: str
:param filename: The local file path to the file to be uploaded.
:type filename: str
:param data: The file's data as a string or bytes to be uploaded.
:type data: str
:param mime_type: The file's mime type set when uploading the file.
:type mime_type: str
:param gzip: Option to compress local file or file data for upload
:type gzip: bool
:param encoding: bytes encoding for file data if provided as string
:type encoding: str
.. method:: exists(self, bucket_name: str, object_name: str)
Checks for the existence of a file in Google Cloud Storage.
:param bucket_name: The Google Cloud Storage bucket where the object is.
:type bucket_name: str
:param object_name: The name of the blob_name to check in the Google cloud
storage bucket.
:type object_name: str
.. method:: get_blob_update_time(self, bucket_name: str, object_name: str)
Get the update time of a file in Google Cloud Storage
:param bucket_name: The Google Cloud Storage bucket where the object is.
:type bucket_name: str
:param object_name: The name of the blob to get updated time from the Google cloud
storage bucket.
:type object_name: str
.. method:: is_updated_after(self, bucket_name: str, object_name: str, ts: datetime)
Checks if an blob_name is updated in Google Cloud Storage.
:param bucket_name: The Google Cloud Storage bucket where the object is.
:type bucket_name: str
:param object_name: The name of the object to check in the Google cloud
storage bucket.
:type object_name: str
:param ts: The timestamp to check against.
:type ts: datetime.datetime
.. method:: is_updated_between(self, bucket_name: str, object_name: str, min_ts: datetime, max_ts: datetime)
Checks if an blob_name is updated in Google Cloud Storage.
:param bucket_name: The Google Cloud Storage bucket where the object is.
:type bucket_name: str
:param object_name: The name of the object to check in the Google cloud
storage bucket.
:type object_name: str
:param min_ts: The minimum timestamp to check against.
:type min_ts: datetime.datetime
:param max_ts: The maximum timestamp to check against.
:type max_ts: datetime.datetime
.. method:: is_updated_before(self, bucket_name: str, object_name: str, ts: datetime)
Checks if an blob_name is updated before given time in Google Cloud Storage.
:param bucket_name: The Google Cloud Storage bucket where the object is.
:type bucket_name: str
:param object_name: The name of the object to check in the Google cloud
storage bucket.
:type object_name: str
:param ts: The timestamp to check against.
:type ts: datetime.datetime
.. method:: is_older_than(self, bucket_name: str, object_name: str, seconds: int)
Check if object is older than given time
:param bucket_name: The Google Cloud Storage bucket where the object is.
:type bucket_name: str
:param object_name: The name of the object to check in the Google cloud
storage bucket.
:type object_name: str
:param seconds: The time in seconds to check against
:type seconds: int
.. method:: delete(self, bucket_name: str, object_name: str)
Deletes an object from the bucket.
:param bucket_name: name of the bucket, where the object resides
:type bucket_name: str
:param object_name: name of the object to delete
:type object_name: str
.. method:: delete_bucket(self, bucket_name: str, force: bool = False)
Delete a bucket object from the Google Cloud Storage.
:param bucket_name: name of the bucket which will be deleted
:type bucket_name: str
:param force: false not allow to delete non empty bucket, set force=True
allows to delete non empty bucket
:type: bool
.. method:: list(self, bucket_name, versions=None, max_results=None, prefix=None, delimiter=None)
List all objects from the bucket with the give string prefix in name
:param bucket_name: bucket name
:type bucket_name: str
:param versions: if true, list all versions of the objects
:type versions: bool
:param max_results: max count of items to return in a single page of responses
:type max_results: int
:param prefix: prefix string which filters objects whose name begin with
this prefix
:type prefix: str
:param delimiter: filters objects based on the delimiter (for e.g '.csv')
:type delimiter: str
:return: a stream of object names matching the filtering criteria
.. method:: get_size(self, bucket_name: str, object_name: str)
Gets the size of a file in Google Cloud Storage.
:param bucket_name: The Google Cloud Storage bucket where the blob_name is.
:type bucket_name: str
:param object_name: The name of the object to check in the Google
cloud storage bucket_name.
:type object_name: str
.. method:: get_crc32c(self, bucket_name: str, object_name: str)
Gets the CRC32c checksum of an object in Google Cloud Storage.
:param bucket_name: The Google Cloud Storage bucket where the blob_name is.
:type bucket_name: str
:param object_name: The name of the object to check in the Google cloud
storage bucket_name.
:type object_name: str
.. method:: get_md5hash(self, bucket_name: str, object_name: str)
Gets the MD5 hash of an object in Google Cloud Storage.
:param bucket_name: The Google Cloud Storage bucket where the blob_name is.
:type bucket_name: str
:param object_name: The name of the object to check in the Google cloud
storage bucket_name.
:type object_name: str
.. method:: create_bucket(self, bucket_name: str, resource: Optional[dict] = None, storage_class: str = 'MULTI_REGIONAL', location: str = 'US', project_id: Optional[str] = None, labels: Optional[dict] = None)
Creates a new bucket. Google Cloud Storage uses a flat namespace, so
you can't create a bucket with a name that is already in use.
.. seealso::
For more information, see Bucket Naming Guidelines:
https://cloud.google.com/storage/docs/bucketnaming.html#requirements
:param bucket_name: The name of the bucket.
:type bucket_name: str
:param resource: An optional dict with parameters for creating the bucket.
For information on available parameters, see Cloud Storage API doc:
https://cloud.google.com/storage/docs/json_api/v1/buckets/insert
:type resource: dict
:param storage_class: This defines how objects in the bucket are stored
and determines the SLA and the cost of storage. Values include
- ``MULTI_REGIONAL``
- ``REGIONAL``
- ``STANDARD``
- ``NEARLINE``
- ``COLDLINE``.
If this value is not specified when the bucket is
created, it will default to STANDARD.
:type storage_class: str
:param location: The location of the bucket.
Object data for objects in the bucket resides in physical storage
within this region. Defaults to US.
.. seealso::
https://developers.google.com/storage/docs/bucket-locations
:type location: str
:param project_id: The ID of the Google Cloud Project.
:type project_id: str
:param labels: User-provided labels, in key/value pairs.
:type labels: dict
:return: If successful, it returns the ``id`` of the bucket.
.. method:: insert_bucket_acl(self, bucket_name: str, entity: str, role: str, user_project: Optional[str] = None)
Creates a new ACL entry on the specified bucket_name.
See: https://cloud.google.com/storage/docs/json_api/v1/bucketAccessControls/insert
:param bucket_name: Name of a bucket_name.
:type bucket_name: str
:param entity: The entity holding the permission, in one of the following forms:
user-userId, user-email, group-groupId, group-email, domain-domain,
project-team-projectId, allUsers, allAuthenticatedUsers.
See: https://cloud.google.com/storage/docs/access-control/lists#scopes
:type entity: str
:param role: The access permission for the entity.
Acceptable values are: "OWNER", "READER", "WRITER".
:type role: str
:param user_project: (Optional) The project to be billed for this request.
Required for Requester Pays buckets.
:type user_project: str
.. method:: insert_object_acl(self, bucket_name: str, object_name: str, entity: str, role: str, generation: Optional[int] = None, user_project: Optional[str] = None)
Creates a new ACL entry on the specified object.
See: https://cloud.google.com/storage/docs/json_api/v1/objectAccessControls/insert
:param bucket_name: Name of a bucket_name.
:type bucket_name: str
:param object_name: Name of the object. For information about how to URL encode
object names to be path safe, see:
https://cloud.google.com/storage/docs/json_api/#encoding
:type object_name: str
:param entity: The entity holding the permission, in one of the following forms:
user-userId, user-email, group-groupId, group-email, domain-domain,
project-team-projectId, allUsers, allAuthenticatedUsers
See: https://cloud.google.com/storage/docs/access-control/lists#scopes
:type entity: str
:param role: The access permission for the entity.
Acceptable values are: "OWNER", "READER".
:type role: str
:param generation: Optional. If present, selects a specific revision of this object.
:type generation: long
:param user_project: (Optional) The project to be billed for this request.
Required for Requester Pays buckets.
:type user_project: str
.. method:: compose(self, bucket_name: str, source_objects: List, destination_object: str)
Composes a list of existing object into a new object in the same storage bucket_name
Currently it only supports up to 32 objects that can be concatenated
in a single operation
https://cloud.google.com/storage/docs/json_api/v1/objects/compose
:param bucket_name: The name of the bucket containing the source objects.
This is also the same bucket to store the composed destination object.
:type bucket_name: str
:param source_objects: The list of source objects that will be composed
into a single object.
:type source_objects: list
:param destination_object: The path of the object if given.
:type destination_object: str
.. method:: sync(self, source_bucket: str, destination_bucket: str, source_object: Optional[str] = None, destination_object: Optional[str] = None, recursive: bool = True, allow_overwrite: bool = False, delete_extra_files: bool = False)
Synchronizes the contents of the buckets.
Parameters ``source_object`` and ``destination_object`` describe the root sync directories. If they
are not passed, the entire bucket will be synchronized. If they are passed, they should point
to directories.
.. note::
The synchronization of individual files is not supported. Only entire directories can be
synchronized.
:param source_bucket: The name of the bucket containing the source objects.
:type source_bucket: str
:param destination_bucket: The name of the bucket containing the destination objects.
:type destination_bucket: str
:param source_object: The root sync directory in the source bucket.
:type source_object: Optional[str]
:param destination_object: The root sync directory in the destination bucket.
:type destination_object: Optional[str]
:param recursive: If True, subdirectories will be considered
:type recursive: bool
:param recursive: If True, subdirectories will be considered
:type recursive: bool
:param allow_overwrite: if True, the files will be overwritten if a mismatched file is found.
By default, overwriting files is not allowed
:type allow_overwrite: bool
:param delete_extra_files: if True, deletes additional files from the source that not found in the
destination. By default extra files are not deleted.
.. note::
This option can delete data quickly if you specify the wrong source/destination combination.
:type delete_extra_files: bool
:return: none
.. method:: _calculate_sync_destination_path(self, blob: storage.Blob, destination_object: Optional[str], source_object_prefix_len: int)
.. method:: _normalize_directory_path(self, source_object: Optional[str])
.. staticmethod:: _prepare_sync_plan(source_bucket: storage.Bucket, destination_bucket: storage.Bucket, source_object: Optional[str], destination_object: Optional[str], recursive: bool)
.. function:: gcs_object_is_directory(bucket: str) -> bool
Return True if given Google Cloud Storage URL (gs://<bucket>/<blob>)
is a directory or an empty bucket. Otherwise return False.
.. function:: _parse_gcs_url(gsurl: str) -> Tuple[str, str]
Given a Google Cloud Storage URL (gs://<bucket>/<blob>), returns a
tuple containing the corresponding bucket and blob.