| :py:mod:`airflow.providers.google.cloud.hooks.dataproc` |
| ======================================================= |
| |
| .. py:module:: airflow.providers.google.cloud.hooks.dataproc |
| |
| .. autoapi-nested-parse:: |
| |
| This module contains a Google Cloud Dataproc hook. |
| |
| |
| |
| Module Contents |
| --------------- |
| |
| Classes |
| ~~~~~~~ |
| |
| .. autoapisummary:: |
| |
| airflow.providers.google.cloud.hooks.dataproc.DataProcJobBuilder |
| airflow.providers.google.cloud.hooks.dataproc.DataprocHook |
| |
| |
| |
| |
| .. py:class:: DataProcJobBuilder(project_id, task_id, cluster_name, job_type, properties = None) |
| |
| A helper class for building Dataproc job. |
| |
| .. py:method:: add_labels(self, labels = None) |
| |
| Set labels for Dataproc job. |
| |
| :param labels: Labels for the job query. |
| |
| |
| .. py:method:: add_variables(self, variables = None) |
| |
| Set variables for Dataproc job. |
| |
| :param variables: Variables for the job query. |
| |
| |
| .. py:method:: add_args(self, args = None) |
| |
| Set args for Dataproc job. |
| |
| :param args: Args for the job query. |
| |
| |
| .. py:method:: add_query(self, query) |
| |
| Set query for Dataproc job. |
| |
| :param query: query for the job. |
| |
| |
| .. py:method:: add_query_uri(self, query_uri) |
| |
| Set query uri for Dataproc job. |
| |
| :param query_uri: URI for the job query. |
| |
| |
| .. py:method:: add_jar_file_uris(self, jars = None) |
| |
| Set jars uris for Dataproc job. |
| |
| :param jars: List of jars URIs |
| |
| |
| .. py:method:: add_archive_uris(self, archives = None) |
| |
| Set archives uris for Dataproc job. |
| |
| :param archives: List of archives URIs |
| |
| |
| .. py:method:: add_file_uris(self, files = None) |
| |
| Set file uris for Dataproc job. |
| |
| :param files: List of files URIs |
| |
| |
| .. py:method:: add_python_file_uris(self, pyfiles = None) |
| |
| Set python file uris for Dataproc job. |
| |
| :param pyfiles: List of python files URIs |
| |
| |
| .. py:method:: set_main(self, main_jar = None, main_class = None) |
| |
| Set Dataproc main class. |
| |
| :param main_jar: URI for the main file. |
| :param main_class: Name of the main class. |
| :raises: Exception |
| |
| |
| .. py:method:: set_python_main(self, main) |
| |
| Set Dataproc main python file uri. |
| |
| :param main: URI for the python main file. |
| |
| |
| .. py:method:: set_job_name(self, name) |
| |
| Set Dataproc job name. Job name is sanitized, replacing dots by underscores. |
| |
| :param name: Job name. |
| |
| |
| .. py:method:: build(self) |
| |
| Returns Dataproc job. |
| |
| :return: Dataproc job |
| :rtype: dict |
| |
| |
| |
| .. py:class:: DataprocHook(gcp_conn_id = 'google_cloud_default', delegate_to = None, impersonation_chain = None) |
| |
| Bases: :py:obj:`airflow.providers.google.common.hooks.base_google.GoogleBaseHook` |
| |
| Hook for Google Cloud Dataproc APIs. |
| |
| All the methods in the hook where project_id is used must be called with |
| keyword arguments rather than positional. |
| |
| .. py:method:: get_cluster_client(self, region = None) |
| |
| Returns ClusterControllerClient. |
| |
| |
| .. py:method:: get_template_client(self, region = None) |
| |
| Returns WorkflowTemplateServiceClient. |
| |
| |
| .. py:method:: get_job_client(self, region = None) |
| |
| Returns JobControllerClient. |
| |
| |
| .. py:method:: get_batch_client(self, region = None) |
| |
| Returns BatchControllerClient |
| |
| |
| .. py:method:: wait_for_operation(self, operation, timeout = None) |
| |
| Waits for long-lasting operation to complete. |
| |
| |
| .. py:method:: create_cluster(self, region, project_id, cluster_name, cluster_config = None, virtual_cluster_config = None, labels = None, request_id = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Creates a cluster in a project. |
| |
| :param project_id: Required. The ID of the Google Cloud project that the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param cluster_name: Name of the cluster to create |
| :param labels: Labels that will be assigned to created cluster |
| :param cluster_config: Required. The cluster config to create. |
| If a dict is provided, it must be of the same form as the protobuf message |
| :class:`~google.cloud.dataproc_v1.types.ClusterConfig` |
| :param virtual_cluster_config: Optional. The virtual cluster config, used when creating a Dataproc |
| cluster that does not directly control the underlying compute resources, for example, when |
| creating a `Dataproc-on-GKE cluster` |
| :class:`~google.cloud.dataproc_v1.types.VirtualClusterConfig` |
| :param request_id: Optional. A unique id used to identify the request. If the server receives two |
| ``CreateClusterRequest`` requests with the same id, then the second request will be ignored and |
| the first ``google.longrunning.Operation`` created and stored in the backend is returned. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: delete_cluster(self, region, cluster_name, project_id, cluster_uuid = None, request_id = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Deletes a cluster in a project. |
| |
| :param project_id: Required. The ID of the Google Cloud project that the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param cluster_name: Required. The cluster name. |
| :param cluster_uuid: Optional. Specifying the ``cluster_uuid`` means the RPC should fail |
| if cluster with specified UUID does not exist. |
| :param request_id: Optional. A unique id used to identify the request. If the server receives two |
| ``DeleteClusterRequest`` requests with the same id, then the second request will be ignored and |
| the first ``google.longrunning.Operation`` created and stored in the backend is returned. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: diagnose_cluster(self, region, cluster_name, project_id, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Gets cluster diagnostic information. After the operation completes GCS uri to |
| diagnose is returned |
| |
| :param project_id: Required. The ID of the Google Cloud project that the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param cluster_name: Required. The cluster name. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: get_cluster(self, region, cluster_name, project_id, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Gets the resource representation for a cluster in a project. |
| |
| :param project_id: Required. The ID of the Google Cloud project that the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param cluster_name: Required. The cluster name. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: list_clusters(self, region, filter_, project_id, page_size = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Lists all regions/{region}/clusters in a project. |
| |
| :param project_id: Required. The ID of the Google Cloud project that the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param filter_: Optional. A filter constraining the clusters to list. Filters are case-sensitive. |
| :param page_size: The maximum number of resources contained in the underlying API response. If page |
| streaming is performed per- resource, this parameter does not affect the return value. If page |
| streaming is performed per-page, this determines the maximum number of resources in a page. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: update_cluster(self, cluster_name, cluster, update_mask, project_id, region, graceful_decommission_timeout = None, request_id = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Updates a cluster in a project. |
| |
| :param project_id: Required. The ID of the Google Cloud project the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param cluster_name: Required. The cluster name. |
| :param cluster: Required. The changes to the cluster. |
| |
| If a dict is provided, it must be of the same form as the protobuf message |
| :class:`~google.cloud.dataproc_v1.types.Cluster` |
| :param update_mask: Required. Specifies the path, relative to ``Cluster``, of the field to update. For |
| example, to change the number of workers in a cluster to 5, the ``update_mask`` parameter would be |
| specified as ``config.worker_config.num_instances``, and the ``PATCH`` request body would specify |
| the new value, as follows: |
| |
| :: |
| |
| { "config":{ "workerConfig":{ "numInstances":"5" } } } |
| |
| Similarly, to change the number of preemptible workers in a cluster to 5, the ``update_mask`` |
| parameter would be ``config.secondary_worker_config.num_instances``, and the ``PATCH`` request |
| body would be set as follows: |
| |
| :: |
| |
| { "config":{ "secondaryWorkerConfig":{ "numInstances":"5" } } } |
| |
| If a dict is provided, it must be of the same form as the protobuf message |
| :class:`~google.cloud.dataproc_v1.types.FieldMask` |
| :param graceful_decommission_timeout: Optional. Timeout for graceful YARN decommissioning. Graceful |
| decommissioning allows removing nodes from the cluster without interrupting jobs in progress. |
| Timeout specifies how long to wait for jobs in progress to finish before forcefully removing nodes |
| (and potentially interrupting jobs). Default timeout is 0 (for forceful decommission), and the |
| maximum allowed timeout is 1 day. |
| |
| Only supported on Dataproc image versions 1.2 and higher. |
| |
| If a dict is provided, it must be of the same form as the protobuf message |
| :class:`~google.cloud.dataproc_v1.types.Duration` |
| :param request_id: Optional. A unique id used to identify the request. If the server receives two |
| ``UpdateClusterRequest`` requests with the same id, then the second request will be ignored and |
| the first ``google.longrunning.Operation`` created and stored in the backend is returned. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: create_workflow_template(self, template, project_id, region, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Creates new workflow template. |
| |
| :param project_id: Required. The ID of the Google Cloud project the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param template: The Dataproc workflow template to create. If a dict is provided, |
| it must be of the same form as the protobuf message WorkflowTemplate. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: instantiate_workflow_template(self, template_name, project_id, region, version = None, request_id = None, parameters = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Instantiates a template and begins execution. |
| |
| :param template_name: Name of template to instantiate. |
| :param project_id: Required. The ID of the Google Cloud project the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param version: Optional. The version of workflow template to instantiate. If specified, |
| the workflow will be instantiated only if the current version of |
| the workflow template has the supplied version. |
| This option cannot be used to instantiate a previous version of |
| workflow template. |
| :param request_id: Optional. A tag that prevents multiple concurrent workflow instances |
| with the same tag from running. This mitigates risk of concurrent |
| instances started due to retries. |
| :param parameters: Optional. Map from parameter names to values that should be used for those |
| parameters. Values may not exceed 100 characters. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: instantiate_inline_workflow_template(self, template, project_id, region, request_id = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Instantiates a template and begins execution. |
| |
| :param template: The workflow template to instantiate. If a dict is provided, |
| it must be of the same form as the protobuf message WorkflowTemplate |
| :param project_id: Required. The ID of the Google Cloud project the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param request_id: Optional. A tag that prevents multiple concurrent workflow instances |
| with the same tag from running. This mitigates risk of concurrent |
| instances started due to retries. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: wait_for_job(self, job_id, project_id, region, wait_time = 10, timeout = None) |
| |
| Helper method which polls a job to check if it finishes. |
| |
| :param job_id: Id of the Dataproc job |
| :param project_id: Required. The ID of the Google Cloud project the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param wait_time: Number of seconds between checks |
| :param timeout: How many seconds wait for job to be ready. Used only if ``asynchronous`` is False |
| |
| |
| .. py:method:: get_job(self, job_id, project_id, region, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Gets the resource representation for a job in a project. |
| |
| :param job_id: Id of the Dataproc job |
| :param project_id: Required. The ID of the Google Cloud project the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: submit_job(self, job, project_id, region, request_id = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Submits a job to a cluster. |
| |
| :param job: The job resource. If a dict is provided, |
| it must be of the same form as the protobuf message Job |
| :param project_id: Required. The ID of the Google Cloud project the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param request_id: Optional. A tag that prevents multiple concurrent workflow instances |
| with the same tag from running. This mitigates risk of concurrent |
| instances started due to retries. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: cancel_job(self, job_id, project_id, region = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Starts a job cancellation request. |
| |
| :param project_id: Required. The ID of the Google Cloud project that the job belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param job_id: Required. The job ID. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: create_batch(self, region, project_id, batch, batch_id = None, request_id = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Creates a batch workload. |
| |
| :param project_id: Required. The ID of the Google Cloud project that the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param batch: Required. The batch to create. |
| :param batch_id: Optional. The ID to use for the batch, which will become the final component |
| of the batch's resource name. |
| This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/. |
| :param request_id: Optional. A unique id used to identify the request. If the server receives two |
| ``CreateBatchRequest`` requests with the same id, then the second request will be ignored and |
| the first ``google.longrunning.Operation`` created and stored in the backend is returned. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: delete_batch(self, batch_id, region, project_id, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Deletes the batch workload resource. |
| |
| :param batch_id: Required. The ID to use for the batch, which will become the final component |
| of the batch's resource name. |
| This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/. |
| :param project_id: Required. The ID of the Google Cloud project that the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: get_batch(self, batch_id, region, project_id, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Gets the batch workload resource representation. |
| |
| :param batch_id: Required. The ID to use for the batch, which will become the final component |
| of the batch's resource name. |
| This value must be 4-63 characters. Valid characters are /[a-z][0-9]-/. |
| :param project_id: Required. The ID of the Google Cloud project that the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| .. py:method:: list_batches(self, region, project_id, page_size = None, page_token = None, retry = DEFAULT, timeout = None, metadata = ()) |
| |
| Lists batch workloads. |
| |
| :param project_id: Required. The ID of the Google Cloud project that the cluster belongs to. |
| :param region: Required. The Cloud Dataproc region in which to handle the request. |
| :param page_size: Optional. The maximum number of batches to return in each response. The service may |
| return fewer than this value. The default page size is 20; the maximum page size is 1000. |
| :param page_token: Optional. A page token received from a previous ``ListBatches`` call. |
| Provide this token to retrieve the subsequent page. |
| :param retry: A retry object used to retry requests. If ``None`` is specified, requests will not be |
| retried. |
| :param timeout: The amount of time, in seconds, to wait for the request to complete. Note that if |
| ``retry`` is specified, the timeout applies to each individual attempt. |
| :param metadata: Additional metadata that is provided to the method. |
| |
| |
| |