blob: aae46eb8208fb8d41f35deb24d9d67782845f272 [file] [log] [blame]
:py:mod:`airflow.providers.google.cloud.operators.gcs`
======================================================
.. py:module:: airflow.providers.google.cloud.operators.gcs
.. autoapi-nested-parse::
This module contains a Google Cloud Storage Bucket operator.
Module Contents
---------------
Classes
~~~~~~~
.. autoapisummary::
airflow.providers.google.cloud.operators.gcs.GCSCreateBucketOperator
airflow.providers.google.cloud.operators.gcs.GCSListObjectsOperator
airflow.providers.google.cloud.operators.gcs.GCSDeleteObjectsOperator
airflow.providers.google.cloud.operators.gcs.GCSBucketCreateAclEntryOperator
airflow.providers.google.cloud.operators.gcs.GCSObjectCreateAclEntryOperator
airflow.providers.google.cloud.operators.gcs.GCSFileTransformOperator
airflow.providers.google.cloud.operators.gcs.GCSTimeSpanFileTransformOperator
airflow.providers.google.cloud.operators.gcs.GCSDeleteBucketOperator
airflow.providers.google.cloud.operators.gcs.GCSSynchronizeBucketsOperator
.. py:class:: GCSCreateBucketOperator(*, bucket_name, resource = None, storage_class = 'MULTI_REGIONAL', location = 'US', project_id = None, labels = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, impersonation_chain = None, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`
Creates a new bucket. Google Cloud Storage uses a flat namespace,
so you can't create a bucket with a name that is already in use.
.. seealso::
For more information, see Bucket Naming Guidelines:
https://cloud.google.com/storage/docs/bucketnaming.html#requirements
:param bucket_name: The name of the bucket. (templated)
:param resource: An optional dict with parameters for creating the bucket.
For information on available parameters, see Cloud Storage API doc:
https://cloud.google.com/storage/docs/json_api/v1/buckets/insert
:param storage_class: This defines how objects in the bucket are stored
and determines the SLA and the cost of storage (templated). Values include
- ``MULTI_REGIONAL``
- ``REGIONAL``
- ``STANDARD``
- ``NEARLINE``
- ``COLDLINE``.
If this value is not specified when the bucket is
created, it will default to STANDARD.
:param location: The location of the bucket. (templated)
Object data for objects in the bucket resides in physical storage
within this region. Defaults to US.
.. seealso:: https://developers.google.com/storage/docs/bucket-locations
:param project_id: The ID of the Google Cloud Project. (templated)
:param labels: User-provided labels, in key/value pairs.
:param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud.
:param delegate_to: The account to impersonate using domain-wide delegation of authority,
if any. For this to work, the service account making the request must have
domain-wide delegation enabled.
:param impersonation_chain: Optional service account to impersonate using short-term
credentials, or chained list of accounts required to get the access_token
of the last account in the list, which will be impersonated in the request.
If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
The following Operator would create a new bucket ``test-bucket``
with ``MULTI_REGIONAL`` storage class in ``EU`` region
.. code-block:: python
CreateBucket = GoogleCloudStorageCreateBucketOperator(
task_id="CreateNewBucket",
bucket_name="test-bucket",
storage_class="MULTI_REGIONAL",
location="EU",
labels={"env": "dev", "team": "airflow"},
gcp_conn_id="airflow-conn-id",
)
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['bucket_name', 'storage_class', 'location', 'project_id', 'impersonation_chain']
.. py:attribute:: ui_color
:annotation: = #f0eee4
.. py:attribute:: operator_extra_links
.. py:method:: execute(self, context)
This is the main method to derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.
.. py:class:: GCSListObjectsOperator(*, bucket, prefix = None, delimiter = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, impersonation_chain = None, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`
List all objects from the bucket with the given string prefix and delimiter in name.
This operator returns a python list with the name of objects which can be used by
`xcom` in the downstream task.
:param bucket: The Google Cloud Storage bucket to find the objects. (templated)
:param prefix: Prefix string which filters objects whose name begin with
this prefix. (templated)
:param delimiter: The delimiter by which you want to filter the objects. (templated)
For e.g to lists the CSV files from in a directory in GCS you would use
delimiter='.csv'.
:param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud.
:param delegate_to: The account to impersonate using domain-wide delegation of authority,
if any. For this to work, the service account making the request must have
domain-wide delegation enabled.
:param impersonation_chain: Optional service account to impersonate using short-term
credentials, or chained list of accounts required to get the access_token
of the last account in the list, which will be impersonated in the request.
If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
**Example**:
The following Operator would list all the Avro files from ``sales/sales-2017``
folder in ``data`` bucket. ::
GCS_Files = GoogleCloudStorageListOperator(
task_id='GCS_Files',
bucket='data',
prefix='sales/sales-2017/',
delimiter='.avro',
gcp_conn_id=google_cloud_conn_id
)
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['bucket', 'prefix', 'delimiter', 'impersonation_chain']
.. py:attribute:: ui_color
:annotation: = #f0eee4
.. py:attribute:: operator_extra_links
.. py:method:: execute(self, context)
This is the main method to derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.
.. py:class:: GCSDeleteObjectsOperator(*, bucket_name, objects = None, prefix = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, impersonation_chain = None, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`
Deletes objects from a Google Cloud Storage bucket, either
from an explicit list of object names or all objects
matching a prefix.
:param bucket_name: The GCS bucket to delete from
:param objects: List of objects to delete. These should be the names
of objects in the bucket, not including gs://bucket/
:param prefix: Prefix of objects to delete. All objects matching this
prefix in the bucket will be deleted.
:param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud.
:param delegate_to: The account to impersonate using domain-wide delegation of authority,
if any. For this to work, the service account making the request must have
domain-wide delegation enabled.
:param impersonation_chain: Optional service account to impersonate using short-term
credentials, or chained list of accounts required to get the access_token
of the last account in the list, which will be impersonated in the request.
If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['bucket_name', 'prefix', 'objects', 'impersonation_chain']
.. py:method:: execute(self, context)
This is the main method to derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.
.. py:class:: GCSBucketCreateAclEntryOperator(*, bucket, entity, role, user_project = None, gcp_conn_id = 'google_cloud_default', impersonation_chain = None, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`
Creates a new ACL entry on the specified bucket.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:GCSBucketCreateAclEntryOperator`
:param bucket: Name of a bucket.
:param entity: The entity holding the permission, in one of the following forms:
user-userId, user-email, group-groupId, group-email, domain-domain,
project-team-projectId, allUsers, allAuthenticatedUsers
:param role: The access permission for the entity.
Acceptable values are: "OWNER", "READER", "WRITER".
:param user_project: (Optional) The project to be billed for this request.
Required for Requester Pays buckets.
:param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud.
:param impersonation_chain: Optional service account to impersonate using short-term
credentials, or chained list of accounts required to get the access_token
of the last account in the list, which will be impersonated in the request.
If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['bucket', 'entity', 'role', 'user_project', 'impersonation_chain']
.. py:attribute:: operator_extra_links
.. py:method:: execute(self, context)
This is the main method to derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.
.. py:class:: GCSObjectCreateAclEntryOperator(*, bucket, object_name, entity, role, generation = None, user_project = None, gcp_conn_id = 'google_cloud_default', impersonation_chain = None, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`
Creates a new ACL entry on the specified object.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:GCSObjectCreateAclEntryOperator`
:param bucket: Name of a bucket.
:param object_name: Name of the object. For information about how to URL encode object
names to be path safe, see:
https://cloud.google.com/storage/docs/json_api/#encoding
:param entity: The entity holding the permission, in one of the following forms:
user-userId, user-email, group-groupId, group-email, domain-domain,
project-team-projectId, allUsers, allAuthenticatedUsers
:param role: The access permission for the entity.
Acceptable values are: "OWNER", "READER".
:param generation: Optional. If present, selects a specific revision of this object.
:param user_project: (Optional) The project to be billed for this request.
Required for Requester Pays buckets.
:param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud.
:param impersonation_chain: Optional service account to impersonate using short-term
credentials, or chained list of accounts required to get the access_token
of the last account in the list, which will be impersonated in the request.
If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['bucket', 'object_name', 'entity', 'generation', 'role', 'user_project', 'impersonation_chain']
.. py:attribute:: operator_extra_links
.. py:method:: execute(self, context)
This is the main method to derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.
.. py:class:: GCSFileTransformOperator(*, source_bucket, source_object, transform_script, destination_bucket = None, destination_object = None, gcp_conn_id = 'google_cloud_default', impersonation_chain = None, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`
Copies data from a source GCS location to a temporary location on the
local filesystem. Runs a transformation on this file as specified by
the transformation script and uploads the output to a destination bucket.
If the output bucket is not specified the original file will be
overwritten.
The locations of the source and the destination files in the local
filesystem is provided as an first and second arguments to the
transformation script. The transformation script is expected to read the
data from source, transform it and write the output to the local
destination file.
:param source_bucket: The bucket to locate the source_object. (templated)
:param source_object: The key to be retrieved from GCS. (templated)
:param destination_bucket: The bucket to upload the key after transformation.
If not provided, source_bucket will be used. (templated)
:param destination_object: The key to be written in GCS.
If not provided, source_object will be used. (templated)
:param transform_script: location of the executable transformation script or list of arguments
passed to subprocess ex. `['python', 'script.py', 10]`. (templated)
:param gcp_conn_id: The connection ID to use connecting to Google Cloud.
:param impersonation_chain: Optional service account to impersonate using short-term
credentials, or chained list of accounts required to get the access_token
of the last account in the list, which will be impersonated in the request.
If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['source_bucket', 'source_object', 'destination_bucket', 'destination_object',...
.. py:attribute:: operator_extra_links
.. py:method:: execute(self, context)
This is the main method to derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.
.. py:class:: GCSTimeSpanFileTransformOperator(*, source_bucket, source_prefix, source_gcp_conn_id, destination_bucket, destination_prefix, destination_gcp_conn_id, transform_script, source_impersonation_chain = None, destination_impersonation_chain = None, chunk_size = None, download_continue_on_fail = False, download_num_attempts = 1, upload_continue_on_fail = False, upload_num_attempts = 1, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`
Determines a list of objects that were added or modified at a GCS source
location during a specific time-span, copies them to a temporary location
on the local file system, runs a transform on this file as specified by
the transformation script and uploads the output to the destination bucket.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:GCSTimeSpanFileTransformOperator`
The locations of the source and the destination files in the local
filesystem is provided as an first and second arguments to the
transformation script. The time-span is passed to the transform script as
third and fourth argument as UTC ISO 8601 string.
The transformation script is expected to read the
data from source, transform it and write the output to the local
destination file.
:param source_bucket: The bucket to fetch data from. (templated)
:param source_prefix: Prefix string which filters objects whose name begin with
this prefix. Can interpolate execution date and time components. (templated)
:param source_gcp_conn_id: The connection ID to use connecting to Google Cloud
to download files to be processed.
:param source_impersonation_chain: Optional service account to impersonate using short-term
credentials (to download files to be processed), or chained list of accounts required to
get the access_token of the last account in the list, which will be impersonated in the
request. If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
:param destination_bucket: The bucket to write data to. (templated)
:param destination_prefix: Prefix string for the upload location.
Can interpolate execution date and time components. (templated)
:param destination_gcp_conn_id: The connection ID to use connecting to Google Cloud
to upload processed files.
:param destination_impersonation_chain: Optional service account to impersonate using short-term
credentials (to upload processed files), or chained list of accounts required to get the access_token
of the last account in the list, which will be impersonated in the request.
If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
:param transform_script: location of the executable transformation script or list of arguments
passed to subprocess ex. `['python', 'script.py', 10]`. (templated)
:param chunk_size: The size of a chunk of data when downloading or uploading (in bytes).
This must be a multiple of 256 KB (per the google clout storage API specification).
:param download_continue_on_fail: With this set to true, if a download fails the task does not error out
but will still continue.
:param upload_chunk_size: The size of a chunk of data when uploading (in bytes).
This must be a multiple of 256 KB (per the google clout storage API specification).
:param upload_continue_on_fail: With this set to true, if an upload fails the task does not error out
but will still continue.
:param upload_num_attempts: Number of attempts to try to upload a single file.
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['source_bucket', 'source_prefix', 'destination_bucket', 'destination_prefix',...
.. py:attribute:: operator_extra_links
.. py:method:: interpolate_prefix(prefix, dt)
:staticmethod:
Interpolate prefix with datetime.
:param prefix: The prefix to interpolate
:param dt: The datetime to interpolate
.. py:method:: execute(self, context)
This is the main method to derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.
.. py:class:: GCSDeleteBucketOperator(*, bucket_name, force = True, gcp_conn_id = 'google_cloud_default', impersonation_chain = None, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`
Deletes bucket from a Google Cloud Storage.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:GCSDeleteBucketOperator`
:param bucket_name: name of the bucket which will be deleted
:param force: false not allow to delete non empty bucket, set force=True
allows to delete non empty bucket
:param gcp_conn_id: The connection ID to use connecting to Google Cloud.
:param impersonation_chain: Optional service account to impersonate using short-term
credentials, or chained list of accounts required to get the access_token
of the last account in the list, which will be impersonated in the request.
If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['bucket_name', 'gcp_conn_id', 'impersonation_chain']
.. py:method:: execute(self, context)
This is the main method to derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.
.. py:class:: GCSSynchronizeBucketsOperator(*, source_bucket, destination_bucket, source_object = None, destination_object = None, recursive = True, delete_extra_files = False, allow_overwrite = False, gcp_conn_id = 'google_cloud_default', delegate_to = None, impersonation_chain = None, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`
Synchronizes the contents of the buckets or bucket's directories in the Google Cloud Services.
Parameters ``source_object`` and ``destination_object`` describe the root sync directory. If they are
not passed, the entire bucket will be synchronized. They should point to directories.
.. note::
The synchronization of individual files is not supported. Only entire directories can be
synchronized.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:GCSSynchronizeBuckets`
:param source_bucket: The name of the bucket containing the source objects.
:param destination_bucket: The name of the bucket containing the destination objects.
:param source_object: The root sync directory in the source bucket.
:param destination_object: The root sync directory in the destination bucket.
:param recursive: If True, subdirectories will be considered
:param allow_overwrite: if True, the files will be overwritten if a mismatched file is found.
By default, overwriting files is not allowed
:param delete_extra_files: if True, deletes additional files from the source that not found in the
destination. By default extra files are not deleted.
.. note::
This option can delete data quickly if you specify the wrong source/destination combination.
:param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud.
:param delegate_to: The account to impersonate using domain-wide delegation of authority,
if any. For this to work, the service account making the request must have
domain-wide delegation enabled.
:param impersonation_chain: Optional service account to impersonate using short-term
credentials, or chained list of accounts required to get the access_token
of the last account in the list, which will be impersonated in the request.
If set as a string, the account must grant the originating account
the Service Account Token Creator IAM role.
If set as a sequence, the identities from the list must grant
Service Account Token Creator IAM role to the directly preceding identity, with first
account from the list granting this role to the originating account (templated).
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['source_bucket', 'destination_bucket', 'source_object', 'destination_object', 'recursive',...
.. py:attribute:: operator_extra_links
.. py:method:: execute(self, context)
This is the main method to derive when creating an operator.
Context is the same dictionary used as when rendering jinja templates.
Refer to get_template_context for more context.