blob: 33266c6c07b5b8e465249bf1e8e14c64586a8b6f [file] [log] [blame]
:py:mod:`airflow.providers.apache.beam.operators.beam`
======================================================
.. py:module:: airflow.providers.apache.beam.operators.beam
.. autoapi-nested-parse::
This module contains Apache Beam operators.
Module Contents
---------------
Classes
~~~~~~~
.. autoapisummary::
airflow.providers.apache.beam.operators.beam.BeamDataflowMixin
airflow.providers.apache.beam.operators.beam.BeamBasePipelineOperator
airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator
airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator
airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator
.. py:class:: BeamDataflowMixin
Helper class to store common, Dataflow specific logic for both
:class:`~airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator`,
:class:`~airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator` and
:class:`~airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator`.
.. py:attribute:: dataflow_hook
:annotation: :Optional[airflow.providers.google.cloud.hooks.dataflow.DataflowHook]
.. py:attribute:: dataflow_config
:annotation: :airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration
.. py:attribute:: gcp_conn_id
:annotation: :str
.. py:attribute:: delegate_to
:annotation: :Optional[str]
.. py:class:: BeamBasePipelineOperator(*, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs)
Bases: :py:obj:`airflow.models.BaseOperator`, :py:obj:`BeamDataflowMixin`, :py:obj:`abc.ABC`
Abstract base class for Beam Pipeline Operators.
:param runner: Runner on which pipeline will be run. By default "DirectRunner" is being used.
Other possible options: DataflowRunner, SparkRunner, FlinkRunner, PortableRunner.
See: :class:`~providers.apache.beam.hooks.beam.BeamRunnerType`
See: https://beam.apache.org/documentation/runners/capability-matrix/
:param default_pipeline_options: Map of default pipeline options.
:param pipeline_options: Map of pipeline options.The key must be a dictionary.
The value can contain different types:
* If the value is None, the single option - ``--key`` (without value) will be added.
* If the value is False, this option will be skipped
* If the value is True, the single option - ``--key`` (without value) will be added.
* If the value is list, the many options will be added for each key.
If the value is ``['A', 'B']`` and the key is ``key`` then the ``--key=A --key=B`` options
will be left
* Other value types will be replaced with the Python textual representation.
When defining labels (labels option), you can also provide a dictionary.
:param gcp_conn_id: Optional.
The connection ID to use connecting to Google Cloud Storage if python file is on GCS.
:param delegate_to: Optional.
The account to impersonate using domain-wide delegation of authority,
if any. For this to work, the service account making the request must have
domain-wide delegation enabled.
:param dataflow_config: Dataflow configuration, used when runner type is set to DataflowRunner,
(optional) defaults to None.
.. py:class:: BeamRunPythonPipelineOperator(*, py_file, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, py_interpreter = 'python3', py_options = None, py_requirements = None, py_system_site_packages = False, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs)
Bases: :py:obj:`BeamBasePipelineOperator`
Launching Apache Beam pipelines written in Python. Note that both
``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
execution parameter, and ``default_pipeline_options`` is expected to save
high-level options, for instances, project and zone information, which
apply to all beam operators in the DAG.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:BeamRunPythonPipelineOperator`
.. seealso::
For more detail on Apache Beam have a look at the reference:
https://beam.apache.org/documentation/
:param py_file: Reference to the python Apache Beam pipeline file.py, e.g.,
/some/local/file/path/to/your/python/pipeline/file. (templated)
:param py_options: Additional python options, e.g., ["-m", "-v"].
:param py_interpreter: Python version of the beam pipeline.
If None, this defaults to the python3.
To track python versions supported by beam and related
issues check: https://issues.apache.org/jira/browse/BEAM-1251
:param py_requirements: Additional python package(s) to install.
If a value is passed to this parameter, a new virtual environment has been created with
additional packages installed.
You could also install the apache_beam package if it is not installed on your system or you want
to use a different version.
:param py_system_site_packages: Whether to include system_site_packages in your virtualenv.
See virtualenv documentation for more information.
This option is only relevant if the ``py_requirements`` parameter is not None.
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']
.. py:attribute:: template_fields_renderers
.. py:attribute:: operator_extra_links
.. py:method:: execute(self, context)
Execute the Apache Beam Pipeline.
.. py:method:: on_kill(self)
Override this method to cleanup subprocesses when a task instance
gets killed. Any use of the threading, subprocess or multiprocessing
module within an operator needs to be cleaned up or it will leave
ghost processes behind.
.. py:class:: BeamRunJavaPipelineOperator(*, jar, runner = 'DirectRunner', job_class = None, default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs)
Bases: :py:obj:`BeamBasePipelineOperator`
Launching Apache Beam pipelines written in Java.
Note that both
``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
execution parameter, and ``default_pipeline_options`` is expected to save
high-level pipeline_options, for instances, project and zone information, which
apply to all Apache Beam operators in the DAG.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:BeamRunJavaPipelineOperator`
.. seealso::
For more detail on Apache Beam have a look at the reference:
https://beam.apache.org/documentation/
You need to pass the path to your jar file as a file reference with the ``jar``
parameter, the jar needs to be a self executing jar (see documentation here:
https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar).
Use ``pipeline_options`` to pass on pipeline_options to your job.
:param jar: The reference to a self executing Apache Beam jar (templated).
:param job_class: The name of the Apache Beam pipeline class to be executed, it
is often not the main class configured in the pipeline jar file.
.. py:attribute:: template_fields
:annotation: :Sequence[str] = ['jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']
.. py:attribute:: template_fields_renderers
.. py:attribute:: ui_color
:annotation: = #0273d4
.. py:attribute:: operator_extra_links
.. py:method:: execute(self, context)
Execute the Apache Beam Pipeline.
.. py:method:: on_kill(self)
Override this method to cleanup subprocesses when a task instance
gets killed. Any use of the threading, subprocess or multiprocessing
module within an operator needs to be cleaned up or it will leave
ghost processes behind.
.. py:class:: BeamRunGoPipelineOperator(*, go_file, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs)
Bases: :py:obj:`BeamBasePipelineOperator`
Launching Apache Beam pipelines written in Go. Note that both
``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
execution parameter, and ``default_pipeline_options`` is expected to save
high-level options, for instances, project and zone information, which
apply to all beam operators in the DAG.
.. seealso::
For more information on how to use this operator, take a look at the guide:
:ref:`howto/operator:BeamRunGoPipelineOperator`
.. seealso::
For more detail on Apache Beam have a look at the reference:
https://beam.apache.org/documentation/
:param go_file: Reference to the Go Apache Beam pipeline e.g.,
/some/local/file/path/to/your/go/pipeline/file.go
.. py:attribute:: template_fields
:annotation: = ['go_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']
.. py:attribute:: template_fields_renderers
.. py:attribute:: operator_extra_links
.. py:method:: execute(self, context)
Execute the Apache Beam Pipeline.
.. py:method:: on_kill(self)
Override this method to cleanup subprocesses when a task instance
gets killed. Any use of the threading, subprocess or multiprocessing
module within an operator needs to be cleaned up or it will leave
ghost processes behind.