| :mod:`airflow.contrib.operators.dataproc_operator` |
| ================================================== |
| |
| .. py:module:: airflow.contrib.operators.dataproc_operator |
| |
| .. autoapi-nested-parse:: |
| |
| This module contains Google Dataproc operators. |
| |
| |
| |
| Module Contents |
| --------------- |
| |
| .. py:class:: DataprocOperationBaseOperator(project_id, region='global', gcp_conn_id='google_cloud_default', delegate_to=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.models.BaseOperator` |
| |
| The base class for operators that poll on a Dataproc Operation. |
| |
| |
| .. method:: execute(self, context) |
| |
| |
| |
| |
| .. method:: start(self, context) |
| |
| |
| |
| |
| .. py:class:: DataprocClusterCreateOperator(project_id, cluster_name, num_workers, zone=None, network_uri=None, subnetwork_uri=None, internal_ip_only=None, tags=None, storage_bucket=None, init_actions_uris=None, init_action_timeout='10m', metadata=None, custom_image=None, custom_image_project_id=None, image_version=None, autoscaling_policy=None, properties=None, optional_components=None, num_masters=1, master_machine_type='n1-standard-4', master_disk_type='pd-standard', master_disk_size=500, worker_machine_type='n1-standard-4', worker_disk_type='pd-standard', worker_disk_size=500, num_preemptible_workers=0, labels=None, region='global', service_account=None, service_account_scopes=None, idle_delete_ttl=None, auto_delete_time=None, auto_delete_ttl=None, customer_managed_key=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataprocOperationBaseOperator` |
| |
| Create a new cluster on Google Cloud Dataproc. The operator will wait until the |
| creation is successful or an error occurs in the creation process. |
| |
| The parameters allow to configure the cluster. Please refer to |
| |
| https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters |
| |
| for a detailed explanation on the different parameters. Most of the configuration |
| parameters detailed in the link are available as a parameter to this operator. |
| |
| :param cluster_name: The name of the DataProc cluster to create. (templated) |
| :type cluster_name: str |
| :param project_id: The ID of the google cloud project in which |
| to create the cluster. (templated) |
| :type project_id: str |
| :param num_workers: The # of workers to spin up. If set to zero will |
| spin up cluster in a single node mode |
| :type num_workers: int |
| :param storage_bucket: The storage bucket to use, setting to None lets dataproc |
| generate a custom one for you |
| :type storage_bucket: str |
| :param init_actions_uris: List of GCS uri's containing |
| dataproc initialization scripts |
| :type init_actions_uris: list[str] |
| :param init_action_timeout: Amount of time executable scripts in |
| init_actions_uris has to complete |
| :type init_action_timeout: str |
| :param metadata: dict of key-value google compute engine metadata entries |
| to add to all instances |
| :type metadata: dict |
| :param image_version: the version of software inside the Dataproc cluster |
| :type image_version: str |
| :param custom_image: custom Dataproc image for more info see |
| https://cloud.google.com/dataproc/docs/guides/dataproc-images |
| :type custom_image: str |
| :param custom_image_project_id: project id for the custom Dataproc image, for more info see |
| https://cloud.google.com/dataproc/docs/guides/dataproc-images |
| :type custom_image_project_id: str |
| :param autoscaling_policy: The autoscaling policy used by the cluster. Only resource names |
| including projectid and location (region) are valid. Example: |
| ``projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id]`` |
| :type autoscaling_policy: str |
| :param properties: dict of properties to set on |
| config files (e.g. spark-defaults.conf), see |
| https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.clusters#SoftwareConfig |
| :type properties: dict |
| :param optional_components: List of optional cluster components, for more info see |
| https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component |
| :type optional_components: list[str] |
| :param num_masters: The # of master nodes to spin up |
| :type num_masters: int |
| :param master_machine_type: Compute engine machine type to use for the master node |
| :type master_machine_type: str |
| :param master_disk_type: Type of the boot disk for the master node |
| (default is ``pd-standard``). |
| Valid values: ``pd-ssd`` (Persistent Disk Solid State Drive) or |
| ``pd-standard`` (Persistent Disk Hard Disk Drive). |
| :type master_disk_type: str |
| :param master_disk_size: Disk size for the master node |
| :type master_disk_size: int |
| :param worker_machine_type: Compute engine machine type to use for the worker nodes |
| :type worker_machine_type: str |
| :param worker_disk_type: Type of the boot disk for the worker node |
| (default is ``pd-standard``). |
| Valid values: ``pd-ssd`` (Persistent Disk Solid State Drive) or |
| ``pd-standard`` (Persistent Disk Hard Disk Drive). |
| :type worker_disk_type: str |
| :param worker_disk_size: Disk size for the worker nodes |
| :type worker_disk_size: int |
| :param num_preemptible_workers: The # of preemptible worker nodes to spin up |
| :type num_preemptible_workers: int |
| :param labels: dict of labels to add to the cluster |
| :type labels: dict |
| :param zone: The zone where the cluster will be located. Set to None to auto-zone. (templated) |
| :type zone: str |
| :param network_uri: The network uri to be used for machine communication, cannot be |
| specified with subnetwork_uri |
| :type network_uri: str |
| :param subnetwork_uri: The subnetwork uri to be used for machine communication, |
| cannot be specified with network_uri |
| :type subnetwork_uri: str |
| :param internal_ip_only: If true, all instances in the cluster will only |
| have internal IP addresses. This can only be enabled for subnetwork |
| enabled networks |
| :type internal_ip_only: bool |
| :param tags: The GCE tags to add to all instances |
| :type tags: list[str] |
| :param region: leave as 'global', might become relevant in the future. (templated) |
| :type region: str |
| :param gcp_conn_id: The connection ID to use connecting to Google Cloud Platform. |
| :type gcp_conn_id: str |
| :param delegate_to: The account to impersonate, if any. |
| For this to work, the service account making the request must have domain-wide |
| delegation enabled. |
| :type delegate_to: str |
| :param service_account: The service account of the dataproc instances. |
| :type service_account: str |
| :param service_account_scopes: The URIs of service account scopes to be included. |
| :type service_account_scopes: list[str] |
| :param idle_delete_ttl: The longest duration that cluster would keep alive while |
| staying idle. Passing this threshold will cause cluster to be auto-deleted. |
| A duration in seconds. |
| :type idle_delete_ttl: int |
| :param auto_delete_time: The time when cluster will be auto-deleted. |
| :type auto_delete_time: datetime.datetime |
| :param auto_delete_ttl: The life duration of cluster, the cluster will be |
| auto-deleted at the end of this duration. |
| A duration in seconds. (If auto_delete_time is set this parameter will be ignored) |
| :type auto_delete_ttl: int |
| :param customer_managed_key: The customer-managed key used for disk encryption |
| ``projects/[PROJECT_STORING_KEYS]/locations/[LOCATION]/keyRings/[KEY_RING_NAME]/cryptoKeys/[KEY_NAME]`` # noqa # pylint: disable=line-too-long |
| :type customer_managed_key: str |
| |
| .. attribute:: template_fields |
| :annotation: = ['cluster_name', 'project_id', 'zone', 'region'] |
| |
| |
| |
| |
| .. method:: _get_init_action_timeout(self) |
| |
| |
| |
| |
| .. method:: _build_gce_cluster_config(self, cluster_data) |
| |
| |
| |
| |
| .. method:: _build_lifecycle_config(self, cluster_data) |
| |
| |
| |
| |
| .. method:: _build_cluster_data(self) |
| |
| |
| |
| |
| .. method:: start(self) |
| |
| Create a new cluster on Google Cloud Dataproc. |
| |
| |
| |
| |
| .. py:class:: DataprocClusterScaleOperator(cluster_name, project_id, region='global', num_workers=2, num_preemptible_workers=0, graceful_decommission_timeout=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataprocOperationBaseOperator` |
| |
| Scale, up or down, a cluster on Google Cloud Dataproc. |
| The operator will wait until the cluster is re-scaled. |
| |
| **Example**: :: |
| |
| t1 = DataprocClusterScaleOperator( |
| task_id='dataproc_scale', |
| project_id='my-project', |
| cluster_name='cluster-1', |
| num_workers=10, |
| num_preemptible_workers=10, |
| graceful_decommission_timeout='1h', |
| dag=dag) |
| |
| .. seealso:: |
| For more detail on about scaling clusters have a look at the reference: |
| https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scaling-clusters |
| |
| :param cluster_name: The name of the cluster to scale. (templated) |
| :type cluster_name: str |
| :param project_id: The ID of the google cloud project in which |
| the cluster runs. (templated) |
| :type project_id: str |
| :param region: The region for the dataproc cluster. (templated) |
| :type region: str |
| :param gcp_conn_id: The connection ID to use connecting to Google Cloud Platform. |
| :type gcp_conn_id: str |
| :param num_workers: The new number of workers |
| :type num_workers: int |
| :param num_preemptible_workers: The new number of preemptible workers |
| :type num_preemptible_workers: int |
| :param graceful_decommission_timeout: Timeout for graceful YARN decomissioning. |
| Maximum value is 1d |
| :type graceful_decommission_timeout: str |
| :param delegate_to: The account to impersonate, if any. |
| For this to work, the service account making the request must have domain-wide |
| delegation enabled. |
| :type delegate_to: str |
| |
| .. attribute:: template_fields |
| :annotation: = ['cluster_name', 'project_id', 'region'] |
| |
| |
| |
| |
| .. method:: _build_scale_cluster_data(self) |
| |
| |
| |
| |
| .. staticmethod:: _get_graceful_decommission_timeout(timeout) |
| |
| |
| |
| |
| .. method:: start(self) |
| |
| Scale, up or down, a cluster on Google Cloud Dataproc. |
| |
| |
| |
| |
| .. py:class:: DataprocClusterDeleteOperator(cluster_name, project_id, region='global', *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataprocOperationBaseOperator` |
| |
| Delete a cluster on Google Cloud Dataproc. The operator will wait until the |
| cluster is destroyed. |
| |
| :param cluster_name: The name of the cluster to delete. (templated) |
| :type cluster_name: str |
| :param project_id: The ID of the google cloud project in which |
| the cluster runs. (templated) |
| :type project_id: str |
| :param region: leave as 'global', might become relevant in the future. (templated) |
| :type region: str |
| :param gcp_conn_id: The connection ID to use connecting to Google Cloud Platform. |
| :type gcp_conn_id: str |
| :param delegate_to: The account to impersonate, if any. |
| For this to work, the service account making the request must have domain-wide |
| delegation enabled. |
| :type delegate_to: str |
| |
| .. attribute:: template_fields |
| :annotation: = ['cluster_name', 'project_id', 'region'] |
| |
| |
| |
| |
| .. method:: start(self) |
| |
| Delete a cluster on Google Cloud Dataproc. |
| |
| |
| |
| |
| .. py:class:: DataProcJobBaseOperator(job_name='{{task.task_id}}_{{ds_nodash}}', cluster_name='cluster-1', dataproc_properties=None, dataproc_jars=None, gcp_conn_id='google_cloud_default', delegate_to=None, labels=None, region='global', job_error_states=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.models.BaseOperator` |
| |
| The base class for operators that launch job on DataProc. |
| |
| :param job_name: The job name used in the DataProc cluster. This name by default |
| is the task_id appended with the execution data, but can be templated. The |
| name will always be appended with a random number to avoid name clashes. |
| :type job_name: str |
| :param cluster_name: The name of the DataProc cluster. |
| :type cluster_name: str |
| :param dataproc_properties: Map for the Hive properties. Ideal to put in |
| default arguments (templated) |
| :type dataproc_properties: dict |
| :param dataproc_jars: HCFS URIs of jar files to add to the CLASSPATH of the Hive server and Hadoop |
| MapReduce (MR) tasks. Can contain Hive SerDes and UDFs. (templated) |
| :type dataproc_jars: list |
| :param gcp_conn_id: The connection ID to use connecting to Google Cloud Platform. |
| :type gcp_conn_id: str |
| :param delegate_to: The account to impersonate, if any. |
| For this to work, the service account making the request must have domain-wide |
| delegation enabled. |
| :type delegate_to: str |
| :param labels: The labels to associate with this job. Label keys must contain 1 to 63 characters, |
| and must conform to RFC 1035. Label values may be empty, but, if present, must contain 1 to 63 |
| characters, and must conform to RFC 1035. No more than 32 labels can be associated with a job. |
| :type labels: dict |
| :param region: The specified region where the dataproc cluster is created. |
| :type region: str |
| :param job_error_states: Job states that should be considered error states. |
| Any states in this set will result in an error being raised and failure of the |
| task. Eg, if the ``CANCELLED`` state should also be considered a task failure, |
| pass in ``{'ERROR', 'CANCELLED'}``. Possible values are currently only |
| ``'ERROR'`` and ``'CANCELLED'``, but could change in the future. Defaults to |
| ``{'ERROR'}``. |
| :type job_error_states: set |
| :var dataproc_job_id: The actual "jobId" as submitted to the Dataproc API. |
| This is useful for identifying or linking to the job in the Google Cloud Console |
| Dataproc UI, as the actual "jobId" submitted to the Dataproc API is appended with |
| an 8 character random string. |
| :vartype dataproc_job_id: str |
| |
| .. attribute:: job_type |
| :annotation: = |
| |
| |
| |
| |
| .. method:: create_job_template(self) |
| |
| Initialize `self.job_template` with default values |
| |
| |
| |
| |
| .. method:: execute(self, context) |
| |
| Build `self.job` based on the job template, and submit it. |
| :raises AirflowException if no template has been initialized (see create_job_template) |
| |
| |
| |
| |
| .. method:: on_kill(self) |
| |
| Callback called when the operator is killed. |
| Cancel any running job. |
| |
| |
| |
| |
| .. py:class:: DataProcPigOperator(query=None, query_uri=None, variables=None, dataproc_pig_properties=None, dataproc_pig_jars=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataProcJobBaseOperator` |
| |
| Start a Pig query Job on a Cloud DataProc cluster. The parameters of the operation |
| will be passed to the cluster. |
| |
| It's a good practice to define dataproc_* parameters in the default_args of the dag |
| like the cluster name and UDFs. |
| |
| .. code-block:: python |
| |
| default_args = { |
| 'cluster_name': 'cluster-1', |
| 'dataproc_pig_jars': [ |
| 'gs://example/udf/jar/datafu/1.2.0/datafu.jar', |
| 'gs://example/udf/jar/gpig/1.2/gpig.jar' |
| ] |
| } |
| |
| You can pass a pig script as string or file reference. Use variables to pass on |
| variables for the pig script to be resolved on the cluster or use the parameters to |
| be resolved in the script as template parameters. |
| |
| **Example**: :: |
| |
| t1 = DataProcPigOperator( |
| task_id='dataproc_pig', |
| query='a_pig_script.pig', |
| variables={'out': 'gs://example/output/{{ds}}'}, |
| dag=dag) |
| |
| .. seealso:: |
| For more detail on about job submission have a look at the reference: |
| https://cloud.google.com/dataproc/reference/rest/v1/projects.regions.jobs |
| |
| :param query: The query or reference to the query |
| file (pg or pig extension). (templated) |
| :type query: str |
| :param query_uri: The HCFS URI of the script that contains the Pig queries. |
| :type query_uri: str |
| :param variables: Map of named parameters for the query. (templated) |
| :type variables: dict |
| :param dataproc_pig_properties: Map for the Pig properties. Ideal to put in |
| default arguments (templated) |
| :type dataproc_pig_properties: dict |
| :param dataproc_pig_jars: HCFS URIs of jar files to add to the CLASSPATH of the Pig Client and Hadoop |
| MapReduce (MR) tasks. Can contain Pig UDFs. (templated) |
| :type dataproc_pig_jars: list |
| |
| .. attribute:: template_fields |
| :annotation: = ['query', 'variables', 'job_name', 'cluster_name', 'region', 'dataproc_jars', 'dataproc_properties'] |
| |
| |
| |
| .. attribute:: template_ext |
| :annotation: = ['.pg', '.pig'] |
| |
| |
| |
| .. attribute:: ui_color |
| :annotation: = #0273d4 |
| |
| |
| |
| .. attribute:: job_type |
| :annotation: = pigJob |
| |
| |
| |
| |
| .. method:: execute(self, context) |
| |
| |
| |
| |
| .. py:class:: DataProcHiveOperator(query=None, query_uri=None, variables=None, dataproc_hive_properties=None, dataproc_hive_jars=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataProcJobBaseOperator` |
| |
| Start a Hive query Job on a Cloud DataProc cluster. |
| |
| :param query: The query or reference to the query file (q extension). |
| :type query: str |
| :param query_uri: The HCFS URI of the script that contains the Hive queries. |
| :type query_uri: str |
| :param variables: Map of named parameters for the query. |
| :type variables: dict |
| :param dataproc_hive_properties: Map for the Pig properties. Ideal to put in |
| default arguments (templated) |
| :type dataproc_hive_properties: dict |
| :param dataproc_hive_jars: HCFS URIs of jar files to add to the CLASSPATH of the Hive server and Hadoop |
| MapReduce (MR) tasks. Can contain Hive SerDes and UDFs. (templated) |
| :type dataproc_hive_jars: list |
| |
| .. attribute:: template_fields |
| :annotation: = ['query', 'variables', 'job_name', 'cluster_name', 'region', 'dataproc_jars', 'dataproc_properties'] |
| |
| |
| |
| .. attribute:: template_ext |
| :annotation: = ['.q', '.hql'] |
| |
| |
| |
| .. attribute:: ui_color |
| :annotation: = #0273d4 |
| |
| |
| |
| .. attribute:: job_type |
| :annotation: = hiveJob |
| |
| |
| |
| |
| .. method:: execute(self, context) |
| |
| |
| |
| |
| .. py:class:: DataProcSparkSqlOperator(query=None, query_uri=None, variables=None, dataproc_spark_properties=None, dataproc_spark_jars=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataProcJobBaseOperator` |
| |
| Start a Spark SQL query Job on a Cloud DataProc cluster. |
| |
| :param query: The query or reference to the query file (q extension). (templated) |
| :type query: str |
| :param query_uri: The HCFS URI of the script that contains the SQL queries. |
| :type query_uri: str |
| :param variables: Map of named parameters for the query. (templated) |
| :type variables: dict |
| :param dataproc_spark_properties: Map for the Pig properties. Ideal to put in |
| default arguments (templated) |
| :type dataproc_spark_properties: dict |
| :param dataproc_spark_jars: HCFS URIs of jar files to be added to the Spark CLASSPATH. (templated) |
| :type dataproc_spark_jars: list |
| |
| .. attribute:: template_fields |
| :annotation: = ['query', 'variables', 'job_name', 'cluster_name', 'region', 'dataproc_jars', 'dataproc_properties'] |
| |
| |
| |
| .. attribute:: template_ext |
| :annotation: = ['.q'] |
| |
| |
| |
| .. attribute:: ui_color |
| :annotation: = #0273d4 |
| |
| |
| |
| .. attribute:: job_type |
| :annotation: = sparkSqlJob |
| |
| |
| |
| |
| .. method:: execute(self, context) |
| |
| |
| |
| |
| .. py:class:: DataProcSparkOperator(main_jar=None, main_class=None, arguments=None, archives=None, files=None, dataproc_spark_properties=None, dataproc_spark_jars=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataProcJobBaseOperator` |
| |
| Start a Spark Job on a Cloud DataProc cluster. |
| |
| :param main_jar: The HCFS URI of the jar file that contains the main class |
| (use this or the main_class, not both together). |
| :type main_jar: str |
| :param main_class: Name of the job class. (use this or the main_jar, not both |
| together). |
| :type main_class: str |
| :param arguments: Arguments for the job. (templated) |
| :type arguments: list |
| :param archives: List of archived files that will be unpacked in the work |
| directory. Should be stored in Cloud Storage. |
| :type archives: list |
| :param files: List of files to be copied to the working directory |
| :type files: list |
| :param dataproc_spark_properties: Map for the Pig properties. Ideal to put in |
| default arguments (templated) |
| :type dataproc_spark_properties: dict |
| :param dataproc_spark_jars: HCFS URIs of files to be copied to the working directory of Spark drivers |
| and distributed tasks. Useful for naively parallel tasks. (templated) |
| :type dataproc_spark_jars: list |
| |
| .. attribute:: template_fields |
| :annotation: = ['arguments', 'job_name', 'cluster_name', 'region', 'dataproc_jars', 'dataproc_properties'] |
| |
| |
| |
| .. attribute:: ui_color |
| :annotation: = #0273d4 |
| |
| |
| |
| .. attribute:: job_type |
| :annotation: = sparkJob |
| |
| |
| |
| |
| .. method:: execute(self, context) |
| |
| |
| |
| |
| .. py:class:: DataProcHadoopOperator(main_jar=None, main_class=None, arguments=None, archives=None, files=None, dataproc_hadoop_properties=None, dataproc_hadoop_jars=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataProcJobBaseOperator` |
| |
| Start a Hadoop Job on a Cloud DataProc cluster. |
| |
| :param main_jar: The HCFS URI of the jar file containing the main class |
| (use this or the main_class, not both together). |
| :type main_jar: str |
| :param main_class: Name of the job class. (use this or the main_jar, not both |
| together). |
| :type main_class: str |
| :param arguments: Arguments for the job. (templated) |
| :type arguments: list |
| :param archives: List of archived files that will be unpacked in the work |
| directory. Should be stored in Cloud Storage. |
| :type archives: list |
| :param files: List of files to be copied to the working directory |
| :type files: list |
| :param dataproc_hadoop_properties: Map for the Pig properties. Ideal to put in |
| default arguments (tempplated) |
| :type dataproc_hadoop_properties: dict |
| :param dataproc_hadoop_jars: Jar file URIs to add to the CLASSPATHs of the Hadoop driver and |
| tasks. (templated) |
| :type dataproc_hadoop_jars: list |
| |
| .. attribute:: template_fields |
| :annotation: = ['arguments', 'job_name', 'cluster_name', 'region', 'dataproc_jars', 'dataproc_properties'] |
| |
| |
| |
| .. attribute:: ui_color |
| :annotation: = #0273d4 |
| |
| |
| |
| .. attribute:: job_type |
| :annotation: = hadoopJob |
| |
| |
| |
| |
| .. method:: execute(self, context) |
| |
| |
| |
| |
| .. py:class:: DataProcPySparkOperator(main, arguments=None, archives=None, pyfiles=None, files=None, dataproc_pyspark_properties=None, dataproc_pyspark_jars=None, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataProcJobBaseOperator` |
| |
| Start a PySpark Job on a Cloud DataProc cluster. |
| |
| :param main: [Required] The Hadoop Compatible Filesystem (HCFS) URI of the main |
| Python file to use as the driver. Must be a .py file. (templated) |
| :type main: str |
| :param arguments: Arguments for the job. (templated) |
| :type arguments: list |
| :param archives: List of archived files that will be unpacked in the work |
| directory. Should be stored in Cloud Storage. |
| :type archives: list |
| :param files: List of files to be copied to the working directory |
| :type files: list |
| :param pyfiles: List of Python files to pass to the PySpark framework. |
| Supported file types: .py, .egg, and .zip |
| :type pyfiles: list |
| :param dataproc_pyspark_properties: Map for the Pig properties. Ideal to put in |
| default arguments (templated) |
| :type dataproc_pyspark_properties: dict |
| :param dataproc_pyspark_jars: HCFS URIs of jar files to add to the CLASSPATHs of the Python |
| driver and tasks. (templated) |
| :type dataproc_pyspark_jars: list |
| |
| .. attribute:: template_fields |
| :annotation: = ['main', 'arguments', 'job_name', 'cluster_name', 'region', 'dataproc_jars', 'dataproc_properties'] |
| |
| |
| |
| .. attribute:: ui_color |
| :annotation: = #0273d4 |
| |
| |
| |
| .. attribute:: job_type |
| :annotation: = pysparkJob |
| |
| |
| |
| |
| .. staticmethod:: _generate_temp_filename(filename) |
| |
| |
| |
| |
| .. method:: _upload_file_temp(self, bucket, local_file) |
| |
| Upload a local file to a Google Cloud Storage bucket. |
| |
| |
| |
| |
| .. method:: execute(self, context) |
| |
| |
| |
| |
| .. py:class:: DataprocWorkflowTemplateInstantiateOperator(template_id, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataprocOperationBaseOperator` |
| |
| Instantiate a WorkflowTemplate on Google Cloud Dataproc. The operator will wait |
| until the WorkflowTemplate is finished executing. |
| |
| .. seealso:: |
| Please refer to: |
| https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.workflowTemplates/instantiate |
| |
| :param template_id: The id of the template. (templated) |
| :type template_id: str |
| :param project_id: The ID of the google cloud project in which |
| the template runs |
| :type project_id: str |
| :param region: leave as 'global', might become relevant in the future |
| :type region: str |
| :param gcp_conn_id: The connection ID to use connecting to Google Cloud Platform. |
| :type gcp_conn_id: str |
| :param delegate_to: The account to impersonate, if any. |
| For this to work, the service account making the request must have domain-wide |
| delegation enabled. |
| :type delegate_to: str |
| |
| .. attribute:: template_fields |
| :annotation: = ['template_id'] |
| |
| |
| |
| |
| .. method:: start(self) |
| |
| Instantiate a WorkflowTemplate on Google Cloud Dataproc. |
| |
| |
| |
| |
| .. py:class:: DataprocWorkflowTemplateInstantiateInlineOperator(template, *args, **kwargs) |
| |
| Bases: :class:`airflow.contrib.operators.dataproc_operator.DataprocOperationBaseOperator` |
| |
| Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc. The operator will |
| wait until the WorkflowTemplate is finished executing. |
| |
| .. seealso:: |
| Please refer to: |
| https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.workflowTemplates/instantiateInline |
| |
| :param template: The template contents. (templated) |
| :type template: map |
| :param project_id: The ID of the google cloud project in which |
| the template runs |
| :type project_id: str |
| :param region: leave as 'global', might become relevant in the future |
| :type region: str |
| :param gcp_conn_id: The connection ID to use connecting to Google Cloud Platform. |
| :type gcp_conn_id: str |
| :param delegate_to: The account to impersonate, if any. |
| For this to work, the service account making the request must have domain-wide |
| delegation enabled. |
| :type delegate_to: str |
| |
| .. attribute:: template_fields |
| :annotation: = ['template'] |
| |
| |
| |
| |
| .. method:: start(self) |
| |
| Instantiate a WorkflowTemplate Inline on Google Cloud Dataproc. |
| |
| |
| |
| |