| :py:mod:`airflow.providers.yandex.operators.yandexcloud_dataproc` |
| ================================================================= |
| |
| .. py:module:: airflow.providers.yandex.operators.yandexcloud_dataproc |
| |
| |
| Module Contents |
| --------------- |
| |
| Classes |
| ~~~~~~~ |
| |
| .. autoapisummary:: |
| |
| airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreateClusterOperator |
| airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocDeleteClusterOperator |
| airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreateHiveJobOperator |
| airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreateMapReduceJobOperator |
| airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreateSparkJobOperator |
| airflow.providers.yandex.operators.yandexcloud_dataproc.DataprocCreatePysparkJobOperator |
| |
| |
| |
| |
| .. py:class:: DataprocCreateClusterOperator(*, folder_id = None, cluster_name = None, cluster_description = '', cluster_image_version = None, ssh_public_keys = None, subnet_id = None, services = ('HDFS', 'YARN', 'MAPREDUCE', 'HIVE', 'SPARK'), s3_bucket = None, zone = 'ru-central1-b', service_account_id = None, masternode_resource_preset = None, masternode_disk_size = None, masternode_disk_type = None, datanode_resource_preset = None, datanode_disk_size = None, datanode_disk_type = None, datanode_count = 1, computenode_resource_preset = None, computenode_disk_size = None, computenode_disk_type = None, computenode_count = 0, computenode_max_hosts_count = None, computenode_measurement_duration = None, computenode_warmup_duration = None, computenode_stabilization_duration = None, computenode_preemptible = False, computenode_cpu_utilization_target = None, computenode_decommission_timeout = None, connection_id = None, log_group_id = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Creates Yandex.Cloud Data Proc cluster. |
| |
| :param folder_id: ID of the folder in which cluster should be created. |
| :param cluster_name: Cluster name. Must be unique inside the folder. |
| :param cluster_description: Cluster description. |
| :param cluster_image_version: Cluster image version. Use default. |
| :param ssh_public_keys: List of SSH public keys that will be deployed to created compute instances. |
| :param subnet_id: ID of the subnetwork. All Data Proc cluster nodes will use one subnetwork. |
| :param services: List of services that will be installed to the cluster. Possible options: |
| HDFS, YARN, MAPREDUCE, HIVE, TEZ, ZOOKEEPER, HBASE, SQOOP, FLUME, SPARK, SPARK, ZEPPELIN, OOZIE |
| :param s3_bucket: Yandex.Cloud S3 bucket to store cluster logs. |
| Jobs will not work if the bucket is not specified. |
| :param zone: Availability zone to create cluster in. |
| Currently there are ru-central1-a, ru-central1-b and ru-central1-c. |
| :param service_account_id: Service account id for the cluster. |
| Service account can be created inside the folder. |
| :param masternode_resource_preset: Resources preset (CPU+RAM configuration) |
| for the primary node of the cluster. |
| :param masternode_disk_size: Masternode storage size in GiB. |
| :param masternode_disk_type: Masternode storage type. Possible options: network-ssd, network-hdd. |
| :param datanode_resource_preset: Resources preset (CPU+RAM configuration) |
| for the data nodes of the cluster. |
| :param datanode_disk_size: Datanodes storage size in GiB. |
| :param datanode_disk_type: Datanodes storage type. Possible options: network-ssd, network-hdd. |
| :param computenode_resource_preset: Resources preset (CPU+RAM configuration) |
| for the compute nodes of the cluster. |
| :param computenode_disk_size: Computenodes storage size in GiB. |
| :param computenode_disk_type: Computenodes storage type. Possible options: network-ssd, network-hdd. |
| :param connection_id: ID of the Yandex.Cloud Airflow connection. |
| :param computenode_max_count: Maximum number of nodes of compute autoscaling subcluster. |
| :param computenode_warmup_duration: The warmup time of the instance in seconds. During this time, |
| traffic is sent to the instance, |
| but instance metrics are not collected. In seconds. |
| :param computenode_stabilization_duration: Minimum amount of time in seconds for monitoring before |
| Instance Groups can reduce the number of instances in the group. |
| During this time, the group size doesn't decrease, |
| even if the new metric values indicate that it should. In seconds. |
| :param computenode_preemptible: Preemptible instances are stopped at least once every 24 hours, |
| and can be stopped at any time if their resources are needed by Compute. |
| :param computenode_cpu_utilization_target: Defines an autoscaling rule |
| based on the average CPU utilization of the instance group. |
| in percents. 10-100. |
| By default is not set and default autoscaling strategy is used. |
| :param computenode_decommission_timeout: Timeout to gracefully decommission nodes during downscaling. |
| In seconds. |
| :param log_group_id: Id of log group to write logs. By default logs will be sent to default log group. |
| To disable cloud log sending set cluster property dataproc:disable_cloud_logging = true |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: DataprocDeleteClusterOperator(*, connection_id = None, cluster_id = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Deletes Yandex.Cloud Data Proc cluster. |
| |
| :param connection_id: ID of the Yandex.Cloud Airflow connection. |
| :param cluster_id: ID of the cluster to remove. (templated) |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['cluster_id'] |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: DataprocCreateHiveJobOperator(*, query = None, query_file_uri = None, script_variables = None, continue_on_failure = False, properties = None, name = 'Hive job', cluster_id = None, connection_id = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Runs Hive job in Data Proc cluster. |
| |
| :param query: Hive query. |
| :param query_file_uri: URI of the script that contains Hive queries. Can be placed in HDFS or S3. |
| :param properties: A mapping of property names to values, used to configure Hive. |
| :param script_variables: Mapping of query variable names to values. |
| :param continue_on_failure: Whether to continue executing queries if a query fails. |
| :param name: Name of the job. Used for labeling. |
| :param cluster_id: ID of the cluster to run job in. |
| Will try to take the ID from Dataproc Hook object if it's specified. (templated) |
| :param connection_id: ID of the Yandex.Cloud Airflow connection. |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['cluster_id'] |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: DataprocCreateMapReduceJobOperator(*, main_class = None, main_jar_file_uri = None, jar_file_uris = None, archive_uris = None, file_uris = None, args = None, properties = None, name = 'Mapreduce job', cluster_id = None, connection_id = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Runs Mapreduce job in Data Proc cluster. |
| |
| :param main_jar_file_uri: URI of jar file with job. |
| Can be placed in HDFS or S3. Can be specified instead of main_class. |
| :param main_class: Name of the main class of the job. Can be specified instead of main_jar_file_uri. |
| :param file_uris: URIs of files used in the job. Can be placed in HDFS or S3. |
| :param archive_uris: URIs of archive files used in the job. Can be placed in HDFS or S3. |
| :param jar_file_uris: URIs of JAR files used in the job. Can be placed in HDFS or S3. |
| :param properties: Properties for the job. |
| :param args: Arguments to be passed to the job. |
| :param name: Name of the job. Used for labeling. |
| :param cluster_id: ID of the cluster to run job in. |
| Will try to take the ID from Dataproc Hook object if it's specified. (templated) |
| :param connection_id: ID of the Yandex.Cloud Airflow connection. |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['cluster_id'] |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: DataprocCreateSparkJobOperator(*, main_class = None, main_jar_file_uri = None, jar_file_uris = None, archive_uris = None, file_uris = None, args = None, properties = None, name = 'Spark job', cluster_id = None, connection_id = None, packages = None, repositories = None, exclude_packages = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Runs Spark job in Data Proc cluster. |
| |
| :param main_jar_file_uri: URI of jar file with job. Can be placed in HDFS or S3. |
| :param main_class: Name of the main class of the job. |
| :param file_uris: URIs of files used in the job. Can be placed in HDFS or S3. |
| :param archive_uris: URIs of archive files used in the job. Can be placed in HDFS or S3. |
| :param jar_file_uris: URIs of JAR files used in the job. Can be placed in HDFS or S3. |
| :param properties: Properties for the job. |
| :param args: Arguments to be passed to the job. |
| :param name: Name of the job. Used for labeling. |
| :param cluster_id: ID of the cluster to run job in. |
| Will try to take the ID from Dataproc Hook object if it's specified. (templated) |
| :param connection_id: ID of the Yandex.Cloud Airflow connection. |
| :param packages: List of maven coordinates of jars to include on the driver and executor classpaths. |
| :param repositories: List of additional remote repositories to search for the maven coordinates |
| given with --packages. |
| :param exclude_packages: List of groupId:artifactId, to exclude while resolving the dependencies |
| provided in --packages to avoid dependency conflicts. |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['cluster_id'] |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |
| .. py:class:: DataprocCreatePysparkJobOperator(*, main_python_file_uri = None, python_file_uris = None, jar_file_uris = None, archive_uris = None, file_uris = None, args = None, properties = None, name = 'Pyspark job', cluster_id = None, connection_id = None, packages = None, repositories = None, exclude_packages = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.models.BaseOperator` |
| |
| Runs Pyspark job in Data Proc cluster. |
| |
| :param main_python_file_uri: URI of python file with job. Can be placed in HDFS or S3. |
| :param python_file_uris: URIs of python files used in the job. Can be placed in HDFS or S3. |
| :param file_uris: URIs of files used in the job. Can be placed in HDFS or S3. |
| :param archive_uris: URIs of archive files used in the job. Can be placed in HDFS or S3. |
| :param jar_file_uris: URIs of JAR files used in the job. Can be placed in HDFS or S3. |
| :param properties: Properties for the job. |
| :param args: Arguments to be passed to the job. |
| :param name: Name of the job. Used for labeling. |
| :param cluster_id: ID of the cluster to run job in. |
| Will try to take the ID from Dataproc Hook object if it's specified. (templated) |
| :param connection_id: ID of the Yandex.Cloud Airflow connection. |
| :param packages: List of maven coordinates of jars to include on the driver and executor classpaths. |
| :param repositories: List of additional remote repositories to search for the maven coordinates |
| given with --packages. |
| :param exclude_packages: List of groupId:artifactId, to exclude while resolving the dependencies |
| provided in --packages to avoid dependency conflicts. |
| |
| .. py:attribute:: template_fields |
| :annotation: :Sequence[str] = ['cluster_id'] |
| |
| |
| |
| .. py:method:: execute(self, context) |
| |
| This is the main method to derive when creating an operator. |
| Context is the same dictionary used as when rendering jinja templates. |
| |
| Refer to get_template_context for more context. |
| |
| |
| |