docs-archive/apache-airflow-providers-apache-beam/3.4.0/_sources/_api/airflow/providers/apache/beam/operators/beam/index.rst.txt - airflow-site - Git at Google

 :py:mod:`airflow.providers.apache.beam.operators.beam`
 ======================================================

 .. py:module:: airflow.providers.apache.beam.operators.beam

 .. autoapi-nested-parse::

    This module contains Apache Beam operators.


 Module Contents
 ---------------

 Classes
 ~~~~~~~

 .. autoapisummary::

    airflow.providers.apache.beam.operators.beam.BeamDataflowMixin
    airflow.providers.apache.beam.operators.beam.BeamBasePipelineOperator
    airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator
    airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator
    airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator


 .. py:class:: BeamDataflowMixin

    Helper class to store common, Dataflow specific logic for both
    :class:`~airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator`,
    :class:`~airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator` and
    :class:`~airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator`.

    .. py:attribute:: dataflow_hook
       :annotation: :Optional[airflow.providers.google.cloud.hooks.dataflow.DataflowHook]


    .. py:attribute:: dataflow_config
       :annotation: :airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration


    .. py:attribute:: gcp_conn_id
       :annotation: :str


    .. py:attribute:: delegate_to
       :annotation: :Optional[str]


 .. py:class:: BeamBasePipelineOperator(*, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs)

    Bases: :py:obj:`airflow.models.BaseOperator`, :py:obj:`BeamDataflowMixin`, :py:obj:`abc.ABC`

    Abstract base class for Beam Pipeline Operators.

    :param runner: Runner on which pipeline will be run. By default "DirectRunner" is being used.
        Other possible options: DataflowRunner, SparkRunner, FlinkRunner, PortableRunner.
        See: :class:`~providers.apache.beam.hooks.beam.BeamRunnerType`
        See: https://beam.apache.org/documentation/runners/capability-matrix/

    :param default_pipeline_options: Map of default pipeline options.
    :param pipeline_options: Map of pipeline options.The key must be a dictionary.
        The value can contain different types:

        * If the value is None, the single option - ``--key`` (without value) will be added.
        * If the value is False, this option will be skipped
        * If the value is True, the single option - ``--key`` (without value) will be added.
        * If the value is list, the many options will be added for each key.
          If the value is ``['A', 'B']`` and the key is ``key`` then the ``--key=A --key=B`` options
          will be left
        * Other value types will be replaced with the Python textual representation.

        When defining labels (labels option), you can also provide a dictionary.
    :param gcp_conn_id: Optional.
        The connection ID to use connecting to Google Cloud Storage if python file is on GCS.
    :param delegate_to:  Optional.
        The account to impersonate using domain-wide delegation of authority,
        if any. For this to work, the service account making the request must have
        domain-wide delegation enabled.
    :param dataflow_config: Dataflow configuration, used when runner type is set to DataflowRunner,
        (optional) defaults to None.


 .. py:class:: BeamRunPythonPipelineOperator(*, py_file, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, py_interpreter = 'python3', py_options = None, py_requirements = None, py_system_site_packages = False, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs)

    Bases: :py:obj:`BeamBasePipelineOperator`

    Launching Apache Beam pipelines written in Python. Note that both
    ``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
    execution parameter, and ``default_pipeline_options`` is expected to save
    high-level options, for instances, project and zone information, which
    apply to all beam operators in the DAG.

    .. seealso::
        For more information on how to use this operator, take a look at the guide:
        :ref:`howto/operator:BeamRunPythonPipelineOperator`

    .. seealso::
        For more detail on Apache Beam have a look at the reference:
        https://beam.apache.org/documentation/

    :param py_file: Reference to the python Apache Beam pipeline file.py, e.g.,
        /some/local/file/path/to/your/python/pipeline/file. (templated)
    :param py_options: Additional python options, e.g., ["-m", "-v"].
    :param py_interpreter: Python version of the beam pipeline.
        If None, this defaults to the python3.
        To track python versions supported by beam and related
        issues check: https://issues.apache.org/jira/browse/BEAM-1251
    :param py_requirements: Additional python package(s) to install.
        If a value is passed to this parameter, a new virtual environment has been created with
        additional packages installed.

        You could also install the apache_beam package if it is not installed on your system or you want
        to use a different version.
    :param py_system_site_packages: Whether to include system_site_packages in your virtualenv.
        See virtualenv documentation for more information.

        This option is only relevant if the ``py_requirements`` parameter is not None.

    .. py:attribute:: template_fields
       :annotation: :Sequence[str] = ['py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']


    .. py:attribute:: template_fields_renderers


    .. py:attribute:: operator_extra_links


    .. py:method:: execute(self, context)

       Execute the Apache Beam Pipeline.


    .. py:method:: on_kill(self)

       Override this method to cleanup subprocesses when a task instance
       gets killed. Any use of the threading, subprocess or multiprocessing
       module within an operator needs to be cleaned up or it will leave
       ghost processes behind.


 .. py:class:: BeamRunJavaPipelineOperator(*, jar, runner = 'DirectRunner', job_class = None, default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs)

    Bases: :py:obj:`BeamBasePipelineOperator`

    Launching Apache Beam pipelines written in Java.

    Note that both
    ``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
    execution parameter, and ``default_pipeline_options`` is expected to save
    high-level pipeline_options, for instances, project and zone information, which
    apply to all Apache Beam operators in the DAG.

    .. seealso::
        For more information on how to use this operator, take a look at the guide:
        :ref:`howto/operator:BeamRunJavaPipelineOperator`

    .. seealso::
        For more detail on Apache Beam have a look at the reference:
        https://beam.apache.org/documentation/

    You need to pass the path to your jar file as a file reference with the ``jar``
    parameter, the jar needs to be a self executing jar (see documentation here:
    https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar).
    Use ``pipeline_options`` to pass on pipeline_options to your job.

    :param jar: The reference to a self executing Apache Beam jar (templated).
    :param job_class: The name of the Apache Beam pipeline class to be executed, it
        is often not the main class configured in the pipeline jar file.

    .. py:attribute:: template_fields
       :annotation: :Sequence[str] = ['jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']


    .. py:attribute:: template_fields_renderers


    .. py:attribute:: ui_color
       :annotation: = #0273d4


    .. py:attribute:: operator_extra_links


    .. py:method:: execute(self, context)

       Execute the Apache Beam Pipeline.


    .. py:method:: on_kill(self)

       Override this method to cleanup subprocesses when a task instance
       gets killed. Any use of the threading, subprocess or multiprocessing
       module within an operator needs to be cleaned up or it will leave
       ghost processes behind.


 .. py:class:: BeamRunGoPipelineOperator(*, go_file, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs)

    Bases: :py:obj:`BeamBasePipelineOperator`

    Launching Apache Beam pipelines written in Go. Note that both
    ``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
    execution parameter, and ``default_pipeline_options`` is expected to save
    high-level options, for instances, project and zone information, which
    apply to all beam operators in the DAG.

    .. seealso::
        For more information on how to use this operator, take a look at the guide:
        :ref:`howto/operator:BeamRunGoPipelineOperator`

    .. seealso::
        For more detail on Apache Beam have a look at the reference:
        https://beam.apache.org/documentation/

    :param go_file: Reference to the Go Apache Beam pipeline e.g.,
        /some/local/file/path/to/your/go/pipeline/file.go

    .. py:attribute:: template_fields
       :annotation: = ['go_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']


    .. py:attribute:: template_fields_renderers


    .. py:attribute:: operator_extra_links


    .. py:method:: execute(self, context)

       Execute the Apache Beam Pipeline.


    .. py:method:: on_kill(self)

       Override this method to cleanup subprocesses when a task instance
       gets killed. Any use of the threading, subprocess or multiprocessing
       module within an operator needs to be cleaned up or it will leave
       ghost processes behind.
	:py:mod:`airflow.providers.apache.beam.operators.beam`
	======================================================

	.. py:module:: airflow.providers.apache.beam.operators.beam

	.. autoapi-nested-parse::

	This module contains Apache Beam operators.



	Module Contents
	---------------

	Classes
	~~~~~~~

	.. autoapisummary::

	airflow.providers.apache.beam.operators.beam.BeamDataflowMixin
	airflow.providers.apache.beam.operators.beam.BeamBasePipelineOperator
	airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator
	airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator
	airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator




	.. py:class:: BeamDataflowMixin

	Helper class to store common, Dataflow specific logic for both
	:class:`~airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator`,
	:class:`~airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator` and
	:class:`~airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator`.

	.. py:attribute:: dataflow_hook
	:annotation: :Optional[airflow.providers.google.cloud.hooks.dataflow.DataflowHook]



	.. py:attribute:: dataflow_config
	:annotation: :airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration



	.. py:attribute:: gcp_conn_id
	:annotation: :str



	.. py:attribute:: delegate_to
	:annotation: :Optional[str]




	.. py:class:: BeamBasePipelineOperator(, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, *kwargs)

	Bases: :py:obj:`airflow.models.BaseOperator`, :py:obj:`BeamDataflowMixin`, :py:obj:`abc.ABC`

	Abstract base class for Beam Pipeline Operators.

	:param runner: Runner on which pipeline will be run. By default "DirectRunner" is being used.
	Other possible options: DataflowRunner, SparkRunner, FlinkRunner, PortableRunner.
	See: :class:`~providers.apache.beam.hooks.beam.BeamRunnerType`
	See: https://beam.apache.org/documentation/runners/capability-matrix/

	:param default_pipeline_options: Map of default pipeline options.
	:param pipeline_options: Map of pipeline options.The key must be a dictionary.
	The value can contain different types:

	* If the value is None, the single option - ``--key`` (without value) will be added.
	* If the value is False, this option will be skipped
	* If the value is True, the single option - ``--key`` (without value) will be added.
	* If the value is list, the many options will be added for each key.
	If the value is ``['A', 'B']`` and the key is ``key`` then the ``--key=A --key=B`` options
	will be left
	* Other value types will be replaced with the Python textual representation.

	When defining labels (labels option), you can also provide a dictionary.
	:param gcp_conn_id: Optional.
	The connection ID to use connecting to Google Cloud Storage if python file is on GCS.
	:param delegate_to: Optional.
	The account to impersonate using domain-wide delegation of authority,
	if any. For this to work, the service account making the request must have
	domain-wide delegation enabled.
	:param dataflow_config: Dataflow configuration, used when runner type is set to DataflowRunner,
	(optional) defaults to None.


	.. py:class:: BeamRunPythonPipelineOperator(, py_file, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, py_interpreter = 'python3', py_options = None, py_requirements = None, py_system_site_packages = False, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, *kwargs)

	Bases: :py:obj:`BeamBasePipelineOperator`

	Launching Apache Beam pipelines written in Python. Note that both
	``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
	execution parameter, and ``default_pipeline_options`` is expected to save
	high-level options, for instances, project and zone information, which
	apply to all beam operators in the DAG.

	.. seealso::
	For more information on how to use this operator, take a look at the guide:
	:ref:`howto/operator:BeamRunPythonPipelineOperator`

	.. seealso::
	For more detail on Apache Beam have a look at the reference:
	https://beam.apache.org/documentation/

	:param py_file: Reference to the python Apache Beam pipeline file.py, e.g.,
	/some/local/file/path/to/your/python/pipeline/file. (templated)
	:param py_options: Additional python options, e.g., ["-m", "-v"].
	:param py_interpreter: Python version of the beam pipeline.
	If None, this defaults to the python3.
	To track python versions supported by beam and related
	issues check: https://issues.apache.org/jira/browse/BEAM-1251
	:param py_requirements: Additional python package(s) to install.
	If a value is passed to this parameter, a new virtual environment has been created with
	additional packages installed.

	You could also install the apache_beam package if it is not installed on your system or you want
	to use a different version.
	:param py_system_site_packages: Whether to include system_site_packages in your virtualenv.
	See virtualenv documentation for more information.

	This option is only relevant if the ``py_requirements`` parameter is not None.

	.. py:attribute:: template_fields
	:annotation: :Sequence[str] = ['py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']



	.. py:attribute:: template_fields_renderers




	.. py:attribute:: operator_extra_links




	.. py:method:: execute(self, context)

	Execute the Apache Beam Pipeline.


	.. py:method:: on_kill(self)

	Override this method to cleanup subprocesses when a task instance
	gets killed. Any use of the threading, subprocess or multiprocessing
	module within an operator needs to be cleaned up or it will leave
	ghost processes behind.



	.. py:class:: BeamRunJavaPipelineOperator(, jar, runner = 'DirectRunner', job_class = None, default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, *kwargs)

	Bases: :py:obj:`BeamBasePipelineOperator`

	Launching Apache Beam pipelines written in Java.

	Note that both
	``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
	execution parameter, and ``default_pipeline_options`` is expected to save
	high-level pipeline_options, for instances, project and zone information, which
	apply to all Apache Beam operators in the DAG.

	.. seealso::
	For more information on how to use this operator, take a look at the guide:
	:ref:`howto/operator:BeamRunJavaPipelineOperator`

	.. seealso::
	For more detail on Apache Beam have a look at the reference:
	https://beam.apache.org/documentation/

	You need to pass the path to your jar file as a file reference with the ``jar``
	parameter, the jar needs to be a self executing jar (see documentation here:
	https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar).
	Use ``pipeline_options`` to pass on pipeline_options to your job.

	:param jar: The reference to a self executing Apache Beam jar (templated).
	:param job_class: The name of the Apache Beam pipeline class to be executed, it
	is often not the main class configured in the pipeline jar file.

	.. py:attribute:: template_fields
	:annotation: :Sequence[str] = ['jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']



	.. py:attribute:: template_fields_renderers




	.. py:attribute:: ui_color
	:annotation: = #0273d4



	.. py:attribute:: operator_extra_links




	.. py:method:: execute(self, context)

	Execute the Apache Beam Pipeline.


	.. py:method:: on_kill(self)

	Override this method to cleanup subprocesses when a task instance
	gets killed. Any use of the threading, subprocess or multiprocessing
	module within an operator needs to be cleaned up or it will leave
	ghost processes behind.



	.. py:class:: BeamRunGoPipelineOperator(, go_file, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, *kwargs)

	Bases: :py:obj:`BeamBasePipelineOperator`

	Launching Apache Beam pipelines written in Go. Note that both
	``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline
	execution parameter, and ``default_pipeline_options`` is expected to save
	high-level options, for instances, project and zone information, which
	apply to all beam operators in the DAG.

	.. seealso::
	For more information on how to use this operator, take a look at the guide:
	:ref:`howto/operator:BeamRunGoPipelineOperator`

	.. seealso::
	For more detail on Apache Beam have a look at the reference:
	https://beam.apache.org/documentation/

	:param go_file: Reference to the Go Apache Beam pipeline e.g.,
	/some/local/file/path/to/your/go/pipeline/file.go

	.. py:attribute:: template_fields
	:annotation: = ['go_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config']



	.. py:attribute:: template_fields_renderers




	.. py:attribute:: operator_extra_links




	.. py:method:: execute(self, context)

	Execute the Apache Beam Pipeline.


	.. py:method:: on_kill(self)

	Override this method to cleanup subprocesses when a task instance
	gets killed. Any use of the threading, subprocess or multiprocessing
	module within an operator needs to be cleaned up or it will leave
	ghost processes behind.