blob: 493b8f6df31d51e9b19d7b53a7dc8ee4cc9da877 [file] [log] [blame]
:mod:`airflow.providers.apache.beam.operators.beam`
===================================================
.. py:module:: airflow.providers.apache.beam.operators.beam
.. autoapi-nested-parse::
This module contains Apache Beam operators.
Module Contents
---------------
.. py:class:: BeamDataflowMixin
Helper class to store common, Dataflow specific logic for both
:class:`~airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator` and
:class:`~airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator`.
.. attribute:: dataflow_hook
:annotation: :Optional[DataflowHook]
.. attribute:: dataflow_config
:annotation: :Optional[DataflowConfiguration]
.. method:: _set_dataflow(self, pipeline_options: dict, job_name_variable_key: Optional[str] = None)
.. method:: __set_dataflow_hook(self)
.. method:: __get_dataflow_job_name(self)
.. method:: __get_dataflow_pipeline_options(self, pipeline_options: dict, job_name: str, job_name_key: Optional[str] = None)
.. method:: __get_dataflow_process_callback(self)
.. py:class:: BeamRunPythonPipelineOperator(*, py_file: str, runner: str = 'DirectRunner', default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, py_interpreter: str = 'python3', py_options: Optional[List[str]] = None, py_requirements: Optional[List[str]] = None, py_system_site_packages: bool = False, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs)
Bases: :class:`airflow.models.BaseOperator`, :class:`airflow.providers.apache.beam.operators.beam.BeamDataflowMixin`
Launching Apache Beam pipelines written in Python. Note that both
``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
execution parameter, and ``default_pipeline_options`` is expected to save
high-level options, for instances, project and zone information, which
apply to all beam operators in the DAG.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:BeamRunPythonPipelineOperator`
.. seealso::
For more detail on Apache Beam have a look at the reference:
https://beam.apache.org/documentation/
:param py_file: Reference to the python Apache Beam pipeline file.py, e.g.,
/some/local/file/path/to/your/python/pipeline/file. (templated)
:type py_file: str
:param runner: Runner on which pipeline will be run. By default "DirectRunner" is being used.
Other possible options: DataflowRunner, SparkRunner, FlinkRunner.
See: :class:`~providers.apache.beam.hooks.beam.BeamRunnerType`
See: https://beam.apache.org/documentation/runners/capability-matrix/
:type runner: str
:param py_options: Additional python options, e.g., ["-m", "-v"].
:type py_options: list[str]
:param default_pipeline_options: Map of default pipeline options.
:type default_pipeline_options: dict
:param pipeline_options: Map of pipeline options.The key must be a dictionary.
The value can contain different types:
* If the value is None, the single option - ``--key`` (without value) will be added.
* If the value is False, this option will be skipped
* If the value is True, the single option - ``--key`` (without value) will be added.
* If the value is list, the many options will be added for each key.
If the value is ``['A', 'B']`` and the key is ``key`` then the ``--key=A --key-B`` options
will be left
* Other value types will be replaced with the Python textual representation.
When defining labels (``labels`` option), you can also provide a dictionary.
:type pipeline_options: dict
:param py_interpreter: Python version of the beam pipeline.
If None, this defaults to the python3.
To track python versions supported by beam and related
issues check: https://issues.apache.org/jira/browse/BEAM-1251
:type py_interpreter: str
:param py_requirements: Additional python package(s) to install.
If a value is passed to this parameter, a new virtual environment has been created with
additional packages installed.
You could also install the apache_beam package if it is not installed on your system or you want
to use a different version.
:type py_requirements: List[str]
:param py_system_site_packages: Whether to include system_site_packages in your virtualenv.
See virtualenv documentation for more information.
This option is only relevant if the ``py_requirements`` parameter is not None.
:param gcp_conn_id: Optional.
The connection ID to use connecting to Google Cloud Storage if python file is on GCS.
:type gcp_conn_id: str
:param delegate_to: Optional.
The account to impersonate using domain-wide delegation of authority,
if any. For this to work, the service account making the request must have
domain-wide delegation enabled.
:type delegate_to: str
:param dataflow_config: Dataflow configuration, used when runner type is set to DataflowRunner
:type dataflow_config: Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]
.. attribute:: template_fields
:annotation: = ['py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']
.. attribute:: template_fields_renderers
.. method:: execute(self, context)
Execute the Apache Beam Pipeline.
.. method:: on_kill(self)
.. py:class:: BeamRunJavaPipelineOperator(*, jar: str, runner: str = 'DirectRunner', job_class: Optional[str] = None, default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs)
Bases: :class:`airflow.models.BaseOperator`, :class:`airflow.providers.apache.beam.operators.beam.BeamDataflowMixin`
Launching Apache Beam pipelines written in Java.
Note that both
``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
execution parameter, and ``default_pipeline_options`` is expected to save
high-level pipeline_options, for instances, project and zone information, which
apply to all Apache Beam operators in the DAG.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:BeamRunJavaPipelineOperator`
.. seealso::
For more detail on Apache Beam have a look at the reference:
https://beam.apache.org/documentation/
You need to pass the path to your jar file as a file reference with the ``jar``
parameter, the jar needs to be a self executing jar (see documentation here:
https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar).
Use ``pipeline_options`` to pass on pipeline_options to your job.
:param jar: The reference to a self executing Apache Beam jar (templated).
:type jar: str
:param runner: Runner on which pipeline will be run. By default "DirectRunner" is being used.
See:
https://beam.apache.org/documentation/runners/capability-matrix/
:type runner: str
:param job_class: The name of the Apache Beam pipeline class to be executed, it
is often not the main class configured in the pipeline jar file.
:type job_class: str
:param default_pipeline_options: Map of default job pipeline_options.
:type default_pipeline_options: dict
:param pipeline_options: Map of job specific pipeline_options.The key must be a dictionary.
The value can contain different types:
* If the value is None, the single option - ``--key`` (without value) will be added.
* If the value is False, this option will be skipped
* If the value is True, the single option - ``--key`` (without value) will be added.
* If the value is list, the many pipeline_options will be added for each key.
If the value is ``['A', 'B']`` and the key is ``key`` then the ``--key=A --key-B`` pipeline_options
will be left
* Other value types will be replaced with the Python textual representation.
When defining labels (``labels`` option), you can also provide a dictionary.
:type pipeline_options: dict
:param gcp_conn_id: The connection ID to use connecting to Google Cloud Storage if jar is on GCS
:type gcp_conn_id: str
:param delegate_to: The account to impersonate using domain-wide delegation of authority,
if any. For this to work, the service account making the request must have
domain-wide delegation enabled.
:type delegate_to: str
:param dataflow_config: Dataflow configuration, used when runner type is set to DataflowRunner
:type dataflow_config: Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration]
.. attribute:: template_fields
:annotation: = ['jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']
.. attribute:: template_fields_renderers
.. attribute:: ui_color
:annotation: = #0273d4
.. method:: execute(self, context)
Execute the Apache Beam Pipeline.
.. method:: on_kill(self)