| :mod:`airflow.providers.apache.beam.operators.beam` |
| =================================================== |
| |
| .. py:module:: airflow.providers.apache.beam.operators.beam |
| |
| .. autoapi-nested-parse:: |
| |
| This module contains Apache Beam operators. |
| |
| |
| |
| Module Contents |
| --------------- |
| |
| .. py:class:: BeamDataflowMixin |
| |
| Helper class to store common, Dataflow specific logic for both |
| :class:`~airflow.providers.apache.beam.operators.beam.BeamRunPythonPipelineOperator` and |
| :class:`~airflow.providers.apache.beam.operators.beam.BeamRunJavaPipelineOperator`. |
| |
| .. attribute:: dataflow_hook |
| :annotation: :Optional[DataflowHook] |
| |
| |
| |
| .. attribute:: dataflow_config |
| :annotation: :Optional[DataflowConfiguration] |
| |
| |
| |
| |
| .. method:: _set_dataflow(self, pipeline_options: dict, job_name_variable_key: Optional[str] = None) |
| |
| |
| |
| |
| .. method:: __set_dataflow_hook(self) |
| |
| |
| |
| |
| .. method:: __get_dataflow_job_name(self) |
| |
| |
| |
| |
| .. method:: __get_dataflow_pipeline_options(self, pipeline_options: dict, job_name: str, job_name_key: Optional[str] = None) |
| |
| |
| |
| |
| .. method:: __get_dataflow_process_callback(self) |
| |
| |
| |
| |
| .. py:class:: BeamRunPythonPipelineOperator(*, py_file: str, runner: str = 'DirectRunner', default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, py_interpreter: str = 'python3', py_options: Optional[List[str]] = None, py_requirements: Optional[List[str]] = None, py_system_site_packages: bool = False, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs) |
| |
| Bases: :class:`airflow.models.BaseOperator`, :class:`airflow.providers.apache.beam.operators.beam.BeamDataflowMixin` |
| |
| Launching Apache Beam pipelines written in Python. Note that both |
| ``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline |
| execution parameter, and ``default_pipeline_options`` is expected to save |
| high-level options, for instances, project and zone information, which |
| apply to all beam operators in the DAG. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:BeamRunPythonPipelineOperator` |
| |
| .. seealso:: |
| For more detail on Apache Beam have a look at the reference: |
| https://beam.apache.org/documentation/ |
| |
| :param py_file: Reference to the python Apache Beam pipeline file.py, e.g., |
| /some/local/file/path/to/your/python/pipeline/file. (templated) |
| :type py_file: str |
| :param runner: Runner on which pipeline will be run. By default "DirectRunner" is being used. |
| Other possible options: DataflowRunner, SparkRunner, FlinkRunner. |
| See: :class:`~providers.apache.beam.hooks.beam.BeamRunnerType` |
| See: https://beam.apache.org/documentation/runners/capability-matrix/ |
| |
| :type runner: str |
| :param py_options: Additional python options, e.g., ["-m", "-v"]. |
| :type py_options: list[str] |
| :param default_pipeline_options: Map of default pipeline options. |
| :type default_pipeline_options: dict |
| :param pipeline_options: Map of pipeline options.The key must be a dictionary. |
| The value can contain different types: |
| |
| * If the value is None, the single option - ``--key`` (without value) will be added. |
| * If the value is False, this option will be skipped |
| * If the value is True, the single option - ``--key`` (without value) will be added. |
| * If the value is list, the many options will be added for each key. |
| If the value is ``['A', 'B']`` and the key is ``key`` then the ``--key=A --key-B`` options |
| will be left |
| * Other value types will be replaced with the Python textual representation. |
| |
| When defining labels (``labels`` option), you can also provide a dictionary. |
| :type pipeline_options: dict |
| :param py_interpreter: Python version of the beam pipeline. |
| If None, this defaults to the python3. |
| To track python versions supported by beam and related |
| issues check: https://issues.apache.org/jira/browse/BEAM-1251 |
| :type py_interpreter: str |
| :param py_requirements: Additional python package(s) to install. |
| If a value is passed to this parameter, a new virtual environment has been created with |
| additional packages installed. |
| |
| You could also install the apache_beam package if it is not installed on your system or you want |
| to use a different version. |
| :type py_requirements: List[str] |
| :param py_system_site_packages: Whether to include system_site_packages in your virtualenv. |
| See virtualenv documentation for more information. |
| |
| This option is only relevant if the ``py_requirements`` parameter is not None. |
| :param gcp_conn_id: Optional. |
| The connection ID to use connecting to Google Cloud Storage if python file is on GCS. |
| :type gcp_conn_id: str |
| :param delegate_to: Optional. |
| The account to impersonate using domain-wide delegation of authority, |
| if any. For this to work, the service account making the request must have |
| domain-wide delegation enabled. |
| :type delegate_to: str |
| :param dataflow_config: Dataflow configuration, used when runner type is set to DataflowRunner |
| :type dataflow_config: Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration] |
| |
| .. attribute:: template_fields |
| :annotation: = ['py_file', 'runner', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'] |
| |
| |
| |
| .. attribute:: template_fields_renderers |
| |
| |
| |
| |
| |
| .. method:: execute(self, context) |
| |
| Execute the Apache Beam Pipeline. |
| |
| |
| |
| |
| .. method:: on_kill(self) |
| |
| |
| |
| |
| .. py:class:: BeamRunJavaPipelineOperator(*, jar: str, runner: str = 'DirectRunner', job_class: Optional[str] = None, default_pipeline_options: Optional[dict] = None, pipeline_options: Optional[dict] = None, gcp_conn_id: str = 'google_cloud_default', delegate_to: Optional[str] = None, dataflow_config: Optional[Union[DataflowConfiguration, dict]] = None, **kwargs) |
| |
| Bases: :class:`airflow.models.BaseOperator`, :class:`airflow.providers.apache.beam.operators.beam.BeamDataflowMixin` |
| |
| Launching Apache Beam pipelines written in Java. |
| |
| Note that both |
| ``default_pipeline_options`` and ``pipeline_options`` will be merged to specify pipeline |
| execution parameter, and ``default_pipeline_options`` is expected to save |
| high-level pipeline_options, for instances, project and zone information, which |
| apply to all Apache Beam operators in the DAG. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:BeamRunJavaPipelineOperator` |
| |
| .. seealso:: |
| For more detail on Apache Beam have a look at the reference: |
| https://beam.apache.org/documentation/ |
| |
| You need to pass the path to your jar file as a file reference with the ``jar`` |
| parameter, the jar needs to be a self executing jar (see documentation here: |
| https://beam.apache.org/documentation/runners/dataflow/#self-executing-jar). |
| Use ``pipeline_options`` to pass on pipeline_options to your job. |
| |
| :param jar: The reference to a self executing Apache Beam jar (templated). |
| :type jar: str |
| :param runner: Runner on which pipeline will be run. By default "DirectRunner" is being used. |
| See: |
| https://beam.apache.org/documentation/runners/capability-matrix/ |
| :type runner: str |
| :param job_class: The name of the Apache Beam pipeline class to be executed, it |
| is often not the main class configured in the pipeline jar file. |
| :type job_class: str |
| :param default_pipeline_options: Map of default job pipeline_options. |
| :type default_pipeline_options: dict |
| :param pipeline_options: Map of job specific pipeline_options.The key must be a dictionary. |
| The value can contain different types: |
| |
| * If the value is None, the single option - ``--key`` (without value) will be added. |
| * If the value is False, this option will be skipped |
| * If the value is True, the single option - ``--key`` (without value) will be added. |
| * If the value is list, the many pipeline_options will be added for each key. |
| If the value is ``['A', 'B']`` and the key is ``key`` then the ``--key=A --key-B`` pipeline_options |
| will be left |
| * Other value types will be replaced with the Python textual representation. |
| |
| When defining labels (``labels`` option), you can also provide a dictionary. |
| :type pipeline_options: dict |
| :param gcp_conn_id: The connection ID to use connecting to Google Cloud Storage if jar is on GCS |
| :type gcp_conn_id: str |
| :param delegate_to: The account to impersonate using domain-wide delegation of authority, |
| if any. For this to work, the service account making the request must have |
| domain-wide delegation enabled. |
| :type delegate_to: str |
| :param dataflow_config: Dataflow configuration, used when runner type is set to DataflowRunner |
| :type dataflow_config: Union[dict, providers.google.cloud.operators.dataflow.DataflowConfiguration] |
| |
| .. attribute:: template_fields |
| :annotation: = ['jar', 'runner', 'job_class', 'pipeline_options', 'default_pipeline_options', 'dataflow_config'] |
| |
| |
| |
| .. attribute:: template_fields_renderers |
| |
| |
| |
| |
| .. attribute:: ui_color |
| :annotation: = #0273d4 |
| |
| |
| |
| |
| .. method:: execute(self, context) |
| |
| Execute the Apache Beam Pipeline. |
| |
| |
| |
| |
| .. method:: on_kill(self) |
| |
| |
| |
| |