| :py:mod:`airflow.providers.google.cloud.hooks.dataflow` |
| ======================================================= |
| |
| .. py:module:: airflow.providers.google.cloud.hooks.dataflow |
| |
| .. autoapi-nested-parse:: |
| |
| This module contains a Google Dataflow Hook. |
| |
| |
| |
| Module Contents |
| --------------- |
| |
| Classes |
| ~~~~~~~ |
| |
| .. autoapisummary:: |
| |
| airflow.providers.google.cloud.hooks.dataflow.DataflowJobStatus |
| airflow.providers.google.cloud.hooks.dataflow.DataflowJobType |
| airflow.providers.google.cloud.hooks.dataflow.DataflowHook |
| |
| |
| |
| Functions |
| ~~~~~~~~~ |
| |
| .. autoapisummary:: |
| |
| airflow.providers.google.cloud.hooks.dataflow.process_line_and_extract_dataflow_job_id_callback |
| |
| |
| |
| Attributes |
| ~~~~~~~~~~ |
| |
| .. autoapisummary:: |
| |
| airflow.providers.google.cloud.hooks.dataflow.DEFAULT_DATAFLOW_LOCATION |
| airflow.providers.google.cloud.hooks.dataflow.JOB_ID_PATTERN |
| airflow.providers.google.cloud.hooks.dataflow.T |
| |
| |
| .. py:data:: DEFAULT_DATAFLOW_LOCATION |
| :annotation: = us-central1 |
| |
| |
| |
| .. py:data:: JOB_ID_PATTERN |
| |
| |
| |
| |
| .. py:data:: T |
| |
| |
| |
| |
| .. py:function:: process_line_and_extract_dataflow_job_id_callback(on_new_job_id_callback) |
| |
| Returns callback which triggers function passed as `on_new_job_id_callback` when Dataflow job_id is found. |
| To be used for `process_line_callback` in |
| :py:class:`~airflow.providers.apache.beam.hooks.beam.BeamCommandRunner` |
| |
| :param on_new_job_id_callback: Callback called when the job ID is known |
| |
| |
| .. py:class:: DataflowJobStatus |
| |
| Helper class with Dataflow job statuses. |
| Reference: https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.jobs#Job.JobState |
| |
| .. py:attribute:: JOB_STATE_DONE |
| :annotation: = JOB_STATE_DONE |
| |
| |
| |
| .. py:attribute:: JOB_STATE_UNKNOWN |
| :annotation: = JOB_STATE_UNKNOWN |
| |
| |
| |
| .. py:attribute:: JOB_STATE_STOPPED |
| :annotation: = JOB_STATE_STOPPED |
| |
| |
| |
| .. py:attribute:: JOB_STATE_RUNNING |
| :annotation: = JOB_STATE_RUNNING |
| |
| |
| |
| .. py:attribute:: JOB_STATE_FAILED |
| :annotation: = JOB_STATE_FAILED |
| |
| |
| |
| .. py:attribute:: JOB_STATE_CANCELLED |
| :annotation: = JOB_STATE_CANCELLED |
| |
| |
| |
| .. py:attribute:: JOB_STATE_UPDATED |
| :annotation: = JOB_STATE_UPDATED |
| |
| |
| |
| .. py:attribute:: JOB_STATE_DRAINING |
| :annotation: = JOB_STATE_DRAINING |
| |
| |
| |
| .. py:attribute:: JOB_STATE_DRAINED |
| :annotation: = JOB_STATE_DRAINED |
| |
| |
| |
| .. py:attribute:: JOB_STATE_PENDING |
| :annotation: = JOB_STATE_PENDING |
| |
| |
| |
| .. py:attribute:: JOB_STATE_CANCELLING |
| :annotation: = JOB_STATE_CANCELLING |
| |
| |
| |
| .. py:attribute:: JOB_STATE_QUEUED |
| :annotation: = JOB_STATE_QUEUED |
| |
| |
| |
| .. py:attribute:: FAILED_END_STATES |
| |
| |
| |
| |
| .. py:attribute:: SUCCEEDED_END_STATES |
| |
| |
| |
| |
| .. py:attribute:: TERMINAL_STATES |
| |
| |
| |
| |
| .. py:attribute:: AWAITING_STATES |
| |
| |
| |
| |
| |
| .. py:class:: DataflowJobType |
| |
| Helper class with Dataflow job types. |
| |
| .. py:attribute:: JOB_TYPE_UNKNOWN |
| :annotation: = JOB_TYPE_UNKNOWN |
| |
| |
| |
| .. py:attribute:: JOB_TYPE_BATCH |
| :annotation: = JOB_TYPE_BATCH |
| |
| |
| |
| .. py:attribute:: JOB_TYPE_STREAMING |
| :annotation: = JOB_TYPE_STREAMING |
| |
| |
| |
| |
| .. py:class:: DataflowHook(gcp_conn_id = 'google_cloud_default', delegate_to = None, poll_sleep = 10, impersonation_chain = None, drain_pipeline = False, cancel_timeout = 5 * 60, wait_until_finished = None) |
| |
| Bases: :py:obj:`airflow.providers.google.common.hooks.base_google.GoogleBaseHook` |
| |
| Hook for Google Dataflow. |
| |
| All the methods in the hook where project_id is used must be called with |
| keyword arguments rather than positional. |
| |
| .. py:method:: get_conn(self) |
| |
| Returns a Google Cloud Dataflow service object. |
| |
| |
| .. py:method:: start_java_dataflow(self, job_name, variables, jar, project_id, job_class = None, append_job_name = True, multiple_jobs = False, on_new_job_id_callback = None, location = DEFAULT_DATAFLOW_LOCATION) |
| |
| Starts Dataflow java job. |
| |
| :param job_name: The name of the job. |
| :param variables: Variables passed to the job. |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| :param jar: Name of the jar for the job |
| :param job_class: Name of the java class for the job. |
| :param append_job_name: True if unique suffix has to be appended to job name. |
| :param multiple_jobs: True if to check for multiple job in dataflow |
| :param on_new_job_id_callback: Callback called when the job ID is known. |
| :param location: Job location. |
| |
| |
| .. py:method:: start_template_dataflow(self, job_name, variables, parameters, dataflow_template, project_id, append_job_name = True, on_new_job_id_callback = None, on_new_job_callback = None, location = DEFAULT_DATAFLOW_LOCATION, environment = None) |
| |
| Starts Dataflow template job. |
| |
| :param job_name: The name of the job. |
| :param variables: Map of job runtime environment options. |
| It will update environment argument if passed. |
| |
| .. seealso:: |
| For more information on possible configurations, look at the API documentation |
| `https://cloud.google.com/dataflow/pipelines/specifying-exec-params |
| <https://cloud.google.com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment>`__ |
| |
| :param parameters: Parameters for the template |
| :param dataflow_template: GCS path to the template. |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| :param append_job_name: True if unique suffix has to be appended to job name. |
| :param on_new_job_id_callback: (Deprecated) Callback called when the Job is known. |
| :param on_new_job_callback: Callback called when the Job is known. |
| :param location: Job location. |
| |
| .. seealso:: |
| For more information on possible configurations, look at the API documentation |
| `https://cloud.google.com/dataflow/pipelines/specifying-exec-params |
| <https://cloud.google.com/dataflow/docs/reference/rest/v1b3/RuntimeEnvironment>`__ |
| |
| |
| |
| .. py:method:: start_flex_template(self, body, location, project_id, on_new_job_id_callback = None, on_new_job_callback = None) |
| |
| Starts flex templates with the Dataflow pipeline. |
| |
| :param body: The request body. See: |
| https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.flexTemplates/launch#request-body |
| :param location: The location of the Dataflow job (for example europe-west1) |
| :param project_id: The ID of the GCP project that owns the job. |
| If set to ``None`` or missing, the default project_id from the GCP connection is used. |
| :param on_new_job_id_callback: (Deprecated) A callback that is called when a Job ID is detected. |
| :param on_new_job_callback: A callback that is called when a Job is detected. |
| :return: the Job |
| |
| |
| .. py:method:: start_python_dataflow(self, job_name, variables, dataflow, py_options, project_id, py_interpreter = 'python3', py_requirements = None, py_system_site_packages = False, append_job_name = True, on_new_job_id_callback = None, location = DEFAULT_DATAFLOW_LOCATION) |
| |
| Starts Dataflow job. |
| |
| :param job_name: The name of the job. |
| :param variables: Variables passed to the job. |
| :param dataflow: Name of the Dataflow process. |
| :param py_options: Additional options. |
| :param project_id: The ID of the GCP project that owns the job. |
| If set to ``None`` or missing, the default project_id from the GCP connection is used. |
| :param py_interpreter: Python version of the beam pipeline. |
| If None, this defaults to the python3. |
| To track python versions supported by beam and related |
| issues check: https://issues.apache.org/jira/browse/BEAM-1251 |
| :param py_requirements: Additional python package(s) to install. |
| If a value is passed to this parameter, a new virtual environment has been created with |
| additional packages installed. |
| |
| You could also install the apache-beam package if it is not installed on your system or you want |
| to use a different version. |
| :param py_system_site_packages: Whether to include system_site_packages in your virtualenv. |
| See virtualenv documentation for more information. |
| |
| This option is only relevant if the ``py_requirements`` parameter is not None. |
| :param append_job_name: True if unique suffix has to be appended to job name. |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| :param on_new_job_id_callback: Callback called when the job ID is known. |
| :param location: Job location. |
| |
| |
| .. py:method:: build_dataflow_job_name(job_name, append_job_name = True) |
| :staticmethod: |
| |
| Builds Dataflow job name. |
| |
| |
| .. py:method:: is_job_dataflow_running(self, name, project_id, location = DEFAULT_DATAFLOW_LOCATION, variables = None) |
| |
| Helper method to check if jos is still running in dataflow |
| |
| :param name: The name of the job. |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| :param location: Job location. |
| :return: True if job is running. |
| :rtype: bool |
| |
| |
| .. py:method:: cancel_job(self, project_id, job_name = None, job_id = None, location = DEFAULT_DATAFLOW_LOCATION) |
| |
| Cancels the job with the specified name prefix or Job ID. |
| |
| Parameter ``name`` and ``job_id`` are mutually exclusive. |
| |
| :param job_name: Name prefix specifying which jobs are to be canceled. |
| :param job_id: Job ID specifying which jobs are to be canceled. |
| :param location: Job location. |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| |
| |
| .. py:method:: start_sql_job(self, job_name, query, options, project_id, location = DEFAULT_DATAFLOW_LOCATION, on_new_job_id_callback = None, on_new_job_callback = None) |
| |
| Starts Dataflow SQL query. |
| |
| :param job_name: The unique name to assign to the Cloud Dataflow job. |
| :param query: The SQL query to execute. |
| :param options: Job parameters to be executed. |
| For more information, look at: |
| `https://cloud.google.com/sdk/gcloud/reference/beta/dataflow/sql/query |
| <gcloud beta dataflow sql query>`__ |
| command reference |
| :param location: The location of the Dataflow job (for example europe-west1) |
| :param project_id: The ID of the GCP project that owns the job. |
| If set to ``None`` or missing, the default project_id from the GCP connection is used. |
| :param on_new_job_id_callback: (Deprecated) Callback called when the job ID is known. |
| :param on_new_job_callback: Callback called when the job is known. |
| :return: the new job object |
| |
| |
| .. py:method:: get_job(self, job_id, project_id, location = DEFAULT_DATAFLOW_LOCATION) |
| |
| Gets the job with the specified Job ID. |
| |
| :param job_id: Job ID to get. |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| :param location: The location of the Dataflow job (for example europe-west1). See: |
| https://cloud.google.com/dataflow/docs/concepts/regional-endpoints |
| :return: the Job |
| :rtype: dict |
| |
| |
| .. py:method:: fetch_job_metrics_by_id(self, job_id, project_id, location = DEFAULT_DATAFLOW_LOCATION) |
| |
| Gets the job metrics with the specified Job ID. |
| |
| :param job_id: Job ID to get. |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| :param location: The location of the Dataflow job (for example europe-west1). See: |
| https://cloud.google.com/dataflow/docs/concepts/regional-endpoints |
| :return: the JobMetrics. See: |
| https://cloud.google.com/dataflow/docs/reference/rest/v1b3/JobMetrics |
| :rtype: dict |
| |
| |
| .. py:method:: fetch_job_messages_by_id(self, job_id, project_id, location = DEFAULT_DATAFLOW_LOCATION) |
| |
| Gets the job messages with the specified Job ID. |
| |
| :param job_id: Job ID to get. |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| :param location: Job location. |
| :return: the list of JobMessages. See: |
| https://cloud.google.com/dataflow/docs/reference/rest/v1b3/ListJobMessagesResponse#JobMessage |
| :rtype: List[dict] |
| |
| |
| .. py:method:: fetch_job_autoscaling_events_by_id(self, job_id, project_id, location = DEFAULT_DATAFLOW_LOCATION) |
| |
| Gets the job autoscaling events with the specified Job ID. |
| |
| :param job_id: Job ID to get. |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| :param location: Job location. |
| :return: the list of AutoscalingEvents. See: |
| https://cloud.google.com/dataflow/docs/reference/rest/v1b3/ListJobMessagesResponse#autoscalingevent |
| :rtype: List[dict] |
| |
| |
| .. py:method:: wait_for_done(self, job_name, location, project_id, job_id = None, multiple_jobs = False) |
| |
| Wait for Dataflow job. |
| |
| :param job_name: The 'jobName' to use when executing the DataFlow job |
| (templated). This ends up being set in the pipeline options, so any entry |
| with key ``'jobName'`` in ``options`` will be overwritten. |
| :param location: location the job is running |
| :param project_id: Optional, the Google Cloud project ID in which to start a job. |
| If set to None or missing, the default project_id from the Google Cloud connection is used. |
| :param job_id: a Dataflow job ID |
| :param multiple_jobs: If pipeline creates multiple jobs then monitor all jobs |
| |
| |
| |