| :py:mod:`airflow.providers.google.cloud.operators.gcs` |
| ====================================================== |
| |
| .. py:module:: airflow.providers.google.cloud.operators.gcs |
| |
| .. autoapi-nested-parse:: |
| |
| This module contains a Google Cloud Storage Bucket operator. |
| |
| |
| |
| Module Contents |
| --------------- |
| |
| Classes |
| ~~~~~~~ |
| |
| .. autoapisummary:: |
| |
| airflow.providers.google.cloud.operators.gcs.GCSCreateBucketOperator |
| airflow.providers.google.cloud.operators.gcs.GCSListObjectsOperator |
| airflow.providers.google.cloud.operators.gcs.GCSDeleteObjectsOperator |
| airflow.providers.google.cloud.operators.gcs.GCSBucketCreateAclEntryOperator |
| airflow.providers.google.cloud.operators.gcs.GCSObjectCreateAclEntryOperator |
| airflow.providers.google.cloud.operators.gcs.GCSFileTransformOperator |
| airflow.providers.google.cloud.operators.gcs.GCSTimeSpanFileTransformOperator |
| airflow.providers.google.cloud.operators.gcs.GCSDeleteBucketOperator |
| airflow.providers.google.cloud.operators.gcs.GCSSynchronizeBucketsOperator |
| |
| |
| |
| |
| .. py:class:: GCSCreateBucketOperator(*, bucket_name, resource = None, storage_class = 'MULTI_REGIONAL', location = 'US', project_id = None, labels = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, impersonation_chain = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Creates a new bucket. Google Cloud Storage uses a flat namespace, |
| so you can't create a bucket with a name that is already in use. |
| |
| .. seealso:: |
| For more information, see Bucket Naming Guidelines: |
| https://cloud.google.com/storage/docs/bucketnaming.html#requirements |
| |
| :param bucket_name: The name of the bucket. (templated) |
| :param resource: An optional dict with parameters for creating the bucket. |
| For information on available parameters, see Cloud Storage API doc: |
| https://cloud.google.com/storage/docs/json_api/v1/buckets/insert |
| :param storage_class: This defines how objects in the bucket are stored |
| and determines the SLA and the cost of storage (templated). Values include |
| |
| - ``MULTI_REGIONAL`` |
| - ``REGIONAL`` |
| - ``STANDARD`` |
| - ``NEARLINE`` |
| - ``COLDLINE``. |
| |
| If this value is not specified when the bucket is |
| created, it will default to STANDARD. |
| :param location: The location of the bucket. (templated) |
| Object data for objects in the bucket resides in physical storage |
| within this region. Defaults to US. |
| |
| .. seealso:: https://developers.google.com/storage/docs/bucket-locations |
| |
| :param project_id: The ID of the Google Cloud Project. (templated) |
| :param labels: User-provided labels, in key/value pairs. |
| :param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud. |
| :param delegate_to: The account to impersonate using domain-wide delegation of authority, |
| if any. For this to work, the service account making the request must have |
| domain-wide delegation enabled. |
| :param impersonation_chain: Optional service account to impersonate using short-term |
| credentials, or chained list of accounts required to get the access_token |
| of the last account in the list, which will be impersonated in the request. |
| If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| The following Operator would create a new bucket ``test-bucket`` |
| with ``MULTI_REGIONAL`` storage class in ``EU`` region |
| |
| .. code-block:: python |
| |
| CreateBucket = GoogleCloudStorageCreateBucketOperator( |
| task_id="CreateNewBucket", |
| bucket_name="test-bucket", |
| storage_class="MULTI_REGIONAL", |
| location="EU", |
| labels={"env": "dev", "team": "airflow"}, |
| gcp_conn_id="airflow-conn-id", |
| ) |
| |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['bucket_name', 'storage_class', 'location', 'project_id', 'impersonation_chain'] |
| |
| |
| |
| .. py:attribute:: ui_color |
| :annotation: = #f0eee4 |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: GCSListObjectsOperator(*, bucket, prefix = None, delimiter = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, impersonation_chain = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| List all objects from the bucket with the given string prefix and delimiter in name. |
| |
| This operator returns a python list with the name of objects which can be used by |
| `xcom` in the downstream task. |
| |
| :param bucket: The Google Cloud Storage bucket to find the objects. (templated) |
| :param prefix: Prefix string which filters objects whose name begin with |
| this prefix. (templated) |
| :param delimiter: The delimiter by which you want to filter the objects. (templated) |
| For e.g to lists the CSV files from in a directory in GCS you would use |
| delimiter='.csv'. |
| :param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud. |
| :param delegate_to: The account to impersonate using domain-wide delegation of authority, |
| if any. For this to work, the service account making the request must have |
| domain-wide delegation enabled. |
| :param impersonation_chain: Optional service account to impersonate using short-term |
| credentials, or chained list of accounts required to get the access_token |
| of the last account in the list, which will be impersonated in the request. |
| If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| **Example**: |
| The following Operator would list all the Avro files from ``sales/sales-2017`` |
| folder in ``data`` bucket. :: |
| |
| GCS_Files = GoogleCloudStorageListOperator( |
| task_id='GCS_Files', |
| bucket='data', |
| prefix='sales/sales-2017/', |
| delimiter='.avro', |
| gcp_conn_id=google_cloud_conn_id |
| ) |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['bucket', 'prefix', 'delimiter', 'impersonation_chain'] |
| |
| |
| |
| .. py:attribute:: ui_color |
| :annotation: = #f0eee4 |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: GCSDeleteObjectsOperator(*, bucket_name, objects = None, prefix = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, impersonation_chain = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Deletes objects from a Google Cloud Storage bucket, either |
| from an explicit list of object names or all objects |
| matching a prefix. |
| |
| :param bucket_name: The GCS bucket to delete from |
| :param objects: List of objects to delete. These should be the names |
| of objects in the bucket, not including gs://bucket/ |
| :param prefix: Prefix of objects to delete. All objects matching this |
| prefix in the bucket will be deleted. |
| :param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud. |
| :param delegate_to: The account to impersonate using domain-wide delegation of authority, |
| if any. For this to work, the service account making the request must have |
| domain-wide delegation enabled. |
| :param impersonation_chain: Optional service account to impersonate using short-term |
| credentials, or chained list of accounts required to get the access_token |
| of the last account in the list, which will be impersonated in the request. |
| If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['bucket_name', 'prefix', 'objects', 'impersonation_chain'] |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: GCSBucketCreateAclEntryOperator(*, bucket, entity, role, user_project = None, gcp_conn_id = 'google_cloud_default', impersonation_chain = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Creates a new ACL entry on the specified bucket. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:GCSBucketCreateAclEntryOperator` |
| |
| :param bucket: Name of a bucket. |
| :param entity: The entity holding the permission, in one of the following forms: |
| user-userId, user-email, group-groupId, group-email, domain-domain, |
| project-team-projectId, allUsers, allAuthenticatedUsers |
| :param role: The access permission for the entity. |
| Acceptable values are: "OWNER", "READER", "WRITER". |
| :param user_project: (Optional) The project to be billed for this request. |
| Required for Requester Pays buckets. |
| :param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud. |
| :param impersonation_chain: Optional service account to impersonate using short-term |
| credentials, or chained list of accounts required to get the access_token |
| of the last account in the list, which will be impersonated in the request. |
| If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['bucket', 'entity', 'role', 'user_project', 'impersonation_chain'] |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: GCSObjectCreateAclEntryOperator(*, bucket, object_name, entity, role, generation = None, user_project = None, gcp_conn_id = 'google_cloud_default', impersonation_chain = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Creates a new ACL entry on the specified object. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:GCSObjectCreateAclEntryOperator` |
| |
| :param bucket: Name of a bucket. |
| :param object_name: Name of the object. For information about how to URL encode object |
| names to be path safe, see: |
| https://cloud.google.com/storage/docs/json_api/#encoding |
| :param entity: The entity holding the permission, in one of the following forms: |
| user-userId, user-email, group-groupId, group-email, domain-domain, |
| project-team-projectId, allUsers, allAuthenticatedUsers |
| :param role: The access permission for the entity. |
| Acceptable values are: "OWNER", "READER". |
| :param generation: Optional. If present, selects a specific revision of this object. |
| :param user_project: (Optional) The project to be billed for this request. |
| Required for Requester Pays buckets. |
| :param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud. |
| :param impersonation_chain: Optional service account to impersonate using short-term |
| credentials, or chained list of accounts required to get the access_token |
| of the last account in the list, which will be impersonated in the request. |
| If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['bucket', 'object_name', 'entity', 'generation', 'role', 'user_project', 'impersonation_chain'] |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: GCSFileTransformOperator(*, source_bucket, source_object, transform_script, destination_bucket = None, destination_object = None, gcp_conn_id = 'google_cloud_default', impersonation_chain = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Copies data from a source GCS location to a temporary location on the |
| local filesystem. Runs a transformation on this file as specified by |
| the transformation script and uploads the output to a destination bucket. |
| If the output bucket is not specified the original file will be |
| overwritten. |
| |
| The locations of the source and the destination files in the local |
| filesystem is provided as an first and second arguments to the |
| transformation script. The transformation script is expected to read the |
| data from source, transform it and write the output to the local |
| destination file. |
| |
| :param source_bucket: The bucket to locate the source_object. (templated) |
| :param source_object: The key to be retrieved from GCS. (templated) |
| :param destination_bucket: The bucket to upload the key after transformation. |
| If not provided, source_bucket will be used. (templated) |
| :param destination_object: The key to be written in GCS. |
| If not provided, source_object will be used. (templated) |
| :param transform_script: location of the executable transformation script or list of arguments |
| passed to subprocess ex. `['python', 'script.py', 10]`. (templated) |
| :param gcp_conn_id: The connection ID to use connecting to Google Cloud. |
| :param impersonation_chain: Optional service account to impersonate using short-term |
| credentials, or chained list of accounts required to get the access_token |
| of the last account in the list, which will be impersonated in the request. |
| If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['source_bucket', 'source_object', 'destination_bucket', 'destination_object',... |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: GCSTimeSpanFileTransformOperator(*, source_bucket, source_prefix, source_gcp_conn_id, destination_bucket, destination_prefix, destination_gcp_conn_id, transform_script, source_impersonation_chain = None, destination_impersonation_chain = None, chunk_size = None, download_continue_on_fail = False, download_num_attempts = 1, upload_continue_on_fail = False, upload_num_attempts = 1, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Determines a list of objects that were added or modified at a GCS source |
| location during a specific time-span, copies them to a temporary location |
| on the local file system, runs a transform on this file as specified by |
| the transformation script and uploads the output to the destination bucket. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:GCSTimeSpanFileTransformOperator` |
| |
| The locations of the source and the destination files in the local |
| filesystem is provided as an first and second arguments to the |
| transformation script. The time-span is passed to the transform script as |
| third and fourth argument as UTC ISO 8601 string. |
| |
| The transformation script is expected to read the |
| data from source, transform it and write the output to the local |
| destination file. |
| |
| :param source_bucket: The bucket to fetch data from. (templated) |
| :param source_prefix: Prefix string which filters objects whose name begin with |
| this prefix. Can interpolate execution date and time components. (templated) |
| :param source_gcp_conn_id: The connection ID to use connecting to Google Cloud |
| to download files to be processed. |
| :param source_impersonation_chain: Optional service account to impersonate using short-term |
| credentials (to download files to be processed), or chained list of accounts required to |
| get the access_token of the last account in the list, which will be impersonated in the |
| request. If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| :param destination_bucket: The bucket to write data to. (templated) |
| :param destination_prefix: Prefix string for the upload location. |
| Can interpolate execution date and time components. (templated) |
| :param destination_gcp_conn_id: The connection ID to use connecting to Google Cloud |
| to upload processed files. |
| :param destination_impersonation_chain: Optional service account to impersonate using short-term |
| credentials (to upload processed files), or chained list of accounts required to get the access_token |
| of the last account in the list, which will be impersonated in the request. |
| If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| :param transform_script: location of the executable transformation script or list of arguments |
| passed to subprocess ex. `['python', 'script.py', 10]`. (templated) |
| |
| |
| :param chunk_size: The size of a chunk of data when downloading or uploading (in bytes). |
| This must be a multiple of 256 KB (per the google clout storage API specification). |
| :param download_continue_on_fail: With this set to true, if a download fails the task does not error out |
| but will still continue. |
| :param upload_chunk_size: The size of a chunk of data when uploading (in bytes). |
| This must be a multiple of 256 KB (per the google clout storage API specification). |
| :param upload_continue_on_fail: With this set to true, if an upload fails the task does not error out |
| but will still continue. |
| :param upload_num_attempts: Number of attempts to try to upload a single file. |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['source_bucket', 'source_prefix', 'destination_bucket', 'destination_prefix',... |
| |
| |
| |
| .. py:method:: interpolate_prefix(prefix, dt) |
| :staticmethod: |
| |
| Interpolate prefix with datetime. |
| |
| :param prefix: The prefix to interpolate |
| :param dt: The datetime to interpolate |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: GCSDeleteBucketOperator(*, bucket_name, force = True, gcp_conn_id = 'google_cloud_default', impersonation_chain = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Deletes bucket from a Google Cloud Storage. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:GCSDeleteBucketOperator` |
| |
| :param bucket_name: name of the bucket which will be deleted |
| :param force: false not allow to delete non empty bucket, set force=True |
| allows to delete non empty bucket |
| :param gcp_conn_id: The connection ID to use connecting to Google Cloud. |
| :param impersonation_chain: Optional service account to impersonate using short-term |
| credentials, or chained list of accounts required to get the access_token |
| of the last account in the list, which will be impersonated in the request. |
| If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['bucket_name', 'gcp_conn_id', 'impersonation_chain'] |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: GCSSynchronizeBucketsOperator(*, source_bucket, destination_bucket, source_object = None, destination_object = None, recursive = True, delete_extra_files = False, allow_overwrite = False, gcp_conn_id = 'google_cloud_default', delegate_to = None, impersonation_chain = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Synchronizes the contents of the buckets or bucket's directories in the Google Cloud Services. |
| |
| Parameters ``source_object`` and ``destination_object`` describe the root sync directory. If they are |
| not passed, the entire bucket will be synchronized. They should point to directories. |
| |
| .. note:: |
| The synchronization of individual files is not supported. Only entire directories can be |
| synchronized. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:GCSSynchronizeBuckets` |
| |
| :param source_bucket: The name of the bucket containing the source objects. |
| :param destination_bucket: The name of the bucket containing the destination objects. |
| :param source_object: The root sync directory in the source bucket. |
| :param destination_object: The root sync directory in the destination bucket. |
| :param recursive: If True, subdirectories will be considered |
| :param allow_overwrite: if True, the files will be overwritten if a mismatched file is found. |
| By default, overwriting files is not allowed |
| :param delete_extra_files: if True, deletes additional files from the source that not found in the |
| destination. By default extra files are not deleted. |
| |
| .. note:: |
| This option can delete data quickly if you specify the wrong source/destination combination. |
| |
| :param gcp_conn_id: (Optional) The connection ID used to connect to Google Cloud. |
| :param delegate_to: The account to impersonate using domain-wide delegation of authority, |
| if any. For this to work, the service account making the request must have |
| domain-wide delegation enabled. |
| :param impersonation_chain: Optional service account to impersonate using short-term |
| credentials, or chained list of accounts required to get the access_token |
| of the last account in the list, which will be impersonated in the request. |
| If set as a string, the account must grant the originating account |
| the Service Account Token Creator IAM role. |
| If set as a sequence, the identities from the list must grant |
| Service Account Token Creator IAM role to the directly preceding identity, with first |
| account from the list granting this role to the originating account (templated). |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['source_bucket', 'destination_bucket', 'source_object', 'destination_object', 'recursive',... |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |