| :py:mod:`airflow.providers.apache.beam.operators.beam` |
| ====================================================== |
| |
| .. py:module:: airflow.providers.apache.beam.operators.beam |
| |
| .. autoapi-nested-parse:: |
| |
| This module contains Apache Beam operators. |
| |
| |
| |
| Module Contents |
| --------------- |
| |
| Classes |
| ~~~~~~~ |
| |
| .. autoapisummary:: |
| |
| airflow.providers.apache.beam.operators.beam.BeamDataflowMixin |
| airflow.providers.apache.beam.operators.beam.BeamBasePipelineOperator |
| airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator |
| airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator |
| airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator |
| |
| |
| |
| |
| .. py:class:: BeamDataflowMixin |
| |
| Helper class to store common, Dataflow specific logic for both |
| :class:`~airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator`, |
| :class:`~airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator` and |
| :class:`~airflow.providers.apache.beam.operators.beam.BeamRunGoPipelineOperator`. |
| |
| .. py:attribute:: dataflow_hook |
| :annotation: :Optional[airflow.providers.google.cloud.hooks.dataflow.DataflowHook] |
| |
| |
| |
| .. py:attribute:: dataflow_config |
| :annotation: :airflow.providers.google.cloud.operators.dataflow.DataflowConfiguration |
| |
| |
| |
| .. py:attribute:: gcp_conn_id |
| :annotation: :str |
| |
| |
| |
| .. py:attribute:: delegate_to |
| :annotation: :Optional[str] |
| |
| |
| |
| |
| .. py:class:: BeamBasePipelineOperator(*, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator`, :py:obj:`BeamDataflowMixin`, :py:obj:`abc.ABC` |
| |
| Abstract base class for Beam Pipeline Operators. |
| |
| :param runner: Runner on which pipeline will be run. By default "DirectRunner" is being used. |
| Other possible options: DataflowRunner, SparkRunner, FlinkRunner, PortableRunner. |
| See: :class:`~providers.apache.beam.hooks.beam.BeamRunnerType` |
| See: https://beam.apache.org/documentation/runners/capability-matrix/ |
| |
| :param default_pipeline_options: Map of default pipeline options. |
| :param pipeline_options: Map of pipeline options.The key must be a dictionary. |
| The value can contain different types: |
| |
| * If the value is None, the single option - ``--key`` (without value) will be added. |
| * If the value is False, this option will be skipped |
| * If the value is True, the single option - ``--key`` (without value) will be added. |
| * If the value is list, the many options will be added for each key. |
| If the value is ``['A', 'B']`` and the key is ``key`` then the ``--key=A --key=B`` options |
| will be left |
| * Other value types will be replaced with the Python textual representation. |
| |
| When defining labels (labels option), you can also provide a dictionary. |
| :param gcp_conn_id: Optional. |
| The connection ID to use connecting to Google Cloud Storage if python file is on GCS. |
| :param delegate_to: Optional. |
| The account to impersonate using domain-wide delegation of authority, |
| if any. For this to work, the service account making the request must have |
| domain-wide delegation enabled. |
| :param dataflow_config: Dataflow configuration, used when runner type is set to DataflowRunner, |
| (optional) defaults to None. |
| |
| |
| .. py:class:: BeamRunPythonPipelineOperator(*, py_file, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, py_interpreter = 'python3', py_options = None, py_requirements = None, py_system_site_packages = False, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs) |
| |
| Bases: :py:obj:`BeamBasePipelineOperator` |
| |
| Launching Apache Beam pipelines written in Python. Note that both |
| ``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline |
| execution parameter, and ``default_pipeline_options`` is expected to save |
| high-level options, for instances, project and zone information, which |
| apply to all beam operators in the DAG. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:BeamRunPythonPipelineOperator` |
| |
| .. seealso:: |
| For more detail on Apache Beam have a look at the reference: |
| https://beam.apache.org/documentation/ |
| |
| :param py_file: Reference to the python Apache Beam pipeline file.py, e.g., |
| /some/local/file/path/to/your/python/pipeline/file. (templated) |
| :param py_options: Additional python options, e.g., ["-m", "-v"]. |
| :param py_interpreter: Python version of the beam pipeline. |
| If None, this defaults to the python3. |
| To track python versions supported by beam and related |
| issues check: https://issues.apache.org/jira/browse/BEAM-1251 |
| :param py_requirements: Additional python package(s) to install. |
| If a value is passed to this parameter, a new virtual environment has been created with |
| additional packages installed. |
| |
| You could also install the apache_beam package if it is not installed on your system or you want |
| to use a different version. |
| :param py_system_site_packages: Whether to include system_site_packages in your virtualenv. |
| See virtualenv documentation for more information. |
| |
| This option is only relevant if the ``py_requirements`` parameter is not None. |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'] |
| |
| |
| |
| .. py:attribute:: template_fields_renderers |
| |
| |
| |
| |
| .. py:attribute:: operator_extra_links |
| |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| Execute the Apache Beam Pipeline. |
| |
| |
| .. py:method:: on_kill(self) |
| |
| Override this method to cleanup subprocesses when a task instance |
| gets killed. Any use of the threading, subprocess or multiprocessing |
| module within an operator needs to be cleaned up or it will leave |
| ghost processes behind. |
| |
| |
| |
| .. py:class:: BeamRunJavaPipelineOperator(*, jar, runner = 'DirectRunner', job_class = None, default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs) |
| |
| Bases: :py:obj:`BeamBasePipelineOperator` |
| |
| Launching Apache Beam pipelines written in Java. |
| |
| Note that both |
| ``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline |
| execution parameter, and ``default_pipeline_options`` is expected to save |
| high-level pipeline_options, for instances, project and zone information, which |
| apply to all Apache Beam operators in the DAG. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:BeamRunJavaPipelineOperator` |
| |
| .. seealso:: |
| For more detail on Apache Beam have a look at the reference: |
| https://beam.apache.org/documentation/ |
| |
| You need to pass the path to your jar file as a file reference with the ``jar`` |
| parameter, the jar needs to be a self executing jar (see documentation here: |
| https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). |
| Use ``pipeline_options`` to pass on pipeline_options to your job. |
| |
| :param jar: The reference to a self executing Apache Beam jar (templated). |
| :param job_class: The name of the Apache Beam pipeline class to be executed, it |
| is often not the main class configured in the pipeline jar file. |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'] |
| |
| |
| |
| .. py:attribute:: template_fields_renderers |
| |
| |
| |
| |
| .. py:attribute:: ui_color |
| :annotation: = #0273d4 |
| |
| |
| |
| .. py:attribute:: operator_extra_links |
| |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| Execute the Apache Beam Pipeline. |
| |
| |
| .. py:method:: on_kill(self) |
| |
| Override this method to cleanup subprocesses when a task instance |
| gets killed. Any use of the threading, subprocess or multiprocessing |
| module within an operator needs to be cleaned up or it will leave |
| ghost processes behind. |
| |
| |
| |
| .. py:class:: BeamRunGoPipelineOperator(*, go_file, runner = 'DirectRunner', default_pipeline_options = None, pipeline_options = None, gcp_conn_id = 'google_cloud_default', delegate_to = None, dataflow_config = None, **kwargs) |
| |
| Bases: :py:obj:`BeamBasePipelineOperator` |
| |
| Launching Apache Beam pipelines written in Go. Note that both |
| ``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline |
| execution parameter, and ``default_pipeline_options`` is expected to save |
| high-level options, for instances, project and zone information, which |
| apply to all beam operators in the DAG. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:BeamRunGoPipelineOperator` |
| |
| .. seealso:: |
| For more detail on Apache Beam have a look at the reference: |
| https://beam.apache.org/documentation/ |
| |
| :param go_file: Reference to the Go Apache Beam pipeline e.g., |
| /some/local/file/path/to/your/go/pipeline/file.go |
| |
| .. py:attribute:: template_fields |
| :annotation: = ['go_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'] |
| |
| |
| |
| .. py:attribute:: template_fields_renderers |
| |
| |
| |
| |
| .. py:attribute:: operator_extra_links |
| |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| Execute the Apache Beam Pipeline. |
| |
| |
| .. py:method:: on_kill(self) |
| |
| Override this method to cleanup subprocesses when a task instance |
| gets killed. Any use of the threading, subprocess or multiprocessing |
| module within an operator needs to be cleaned up or it will leave |
| ghost processes behind. |
| |
| |
| |