| :py:mod:`airflow.providers.apache.spark.operators.spark_jdbc` |
| ============================================================= |
| |
| .. py:module:: airflow.providers.apache.spark.operators.spark_jdbc |
| |
| |
| Module Contents |
| --------------- |
| |
| Classes |
| ~~~~~~~ |
| |
| .. autoapisummary:: |
| |
| airflow.providers.apache.spark.operators.spark_jdbc.SparkJDBCOperator |
| |
| |
| |
| |
| .. py:class:: SparkJDBCOperator(*, spark_app_name = 'airflow-spark-jdbc', spark_conn_id = 'spark-default', spark_conf = None, spark_py_files = None, spark_files = None, spark_jars = None, num_executors = None, executor_cores = None, executor_memory = None, driver_memory = None, verbose = False, principal = None, keytab = None, cmd_type = 'spark_to_jdbc', jdbc_table = None, jdbc_conn_id = 'jdbc-default', jdbc_driver = None, metastore_table = None, jdbc_truncate = False, save_mode = None, save_format = None, batch_size = None, fetch_size = None, num_partitions = None, partition_column = None, lower_bound = None, upper_bound = None, create_table_column_types = None, **kwargs) |
| |
| Bases: :py:obj:`airflow.providers.apache.spark.operators.spark_submit.SparkSubmitOperator` |
| |
| This operator extends the SparkSubmitOperator specifically for performing data |
| transfers to/from JDBC-based databases with Apache Spark. As with the |
| SparkSubmitOperator, it assumes that the "spark-submit" binary is available on the |
| PATH. |
| |
| .. seealso:: |
| For more information on how to use this operator, take a look at the guide: |
| :ref:`howto/operator:SparkJDBCOperator` |
| |
| :param spark_app_name: Name of the job (default airflow-spark-jdbc) |
| :param spark_conn_id: The :ref:`spark connection id <howto/connection:spark>` |
| as configured in Airflow administration |
| :param spark_conf: Any additional Spark configuration properties |
| :param spark_py_files: Additional python files used (.zip, .egg, or .py) |
| :param spark_files: Additional files to upload to the container running the job |
| :param spark_jars: Additional jars to upload and add to the driver and |
| executor classpath |
| :param num_executors: number of executor to run. This should be set so as to manage |
| the number of connections made with the JDBC database |
| :param executor_cores: Number of cores per executor |
| :param executor_memory: Memory per executor (e.g. 1000M, 2G) |
| :param driver_memory: Memory allocated to the driver (e.g. 1000M, 2G) |
| :param verbose: Whether to pass the verbose flag to spark-submit for debugging |
| :param keytab: Full path to the file that contains the keytab |
| :param principal: The name of the kerberos principal used for keytab |
| :param cmd_type: Which way the data should flow. 2 possible values: |
| spark_to_jdbc: data written by spark from metastore to jdbc |
| jdbc_to_spark: data written by spark from jdbc to metastore |
| :param jdbc_table: The name of the JDBC table |
| :param jdbc_conn_id: Connection id used for connection to JDBC database |
| :param jdbc_driver: Name of the JDBC driver to use for the JDBC connection. This |
| driver (usually a jar) should be passed in the 'jars' parameter |
| :param metastore_table: The name of the metastore table, |
| :param jdbc_truncate: (spark_to_jdbc only) Whether or not Spark should truncate or |
| drop and recreate the JDBC table. This only takes effect if |
| 'save_mode' is set to Overwrite. Also, if the schema is |
| different, Spark cannot truncate, and will drop and recreate |
| :param save_mode: The Spark save-mode to use (e.g. overwrite, append, etc.) |
| :param save_format: (jdbc_to_spark-only) The Spark save-format to use (e.g. parquet) |
| :param batch_size: (spark_to_jdbc only) The size of the batch to insert per round |
| trip to the JDBC database. Defaults to 1000 |
| :param fetch_size: (jdbc_to_spark only) The size of the batch to fetch per round trip |
| from the JDBC database. Default depends on the JDBC driver |
| :param num_partitions: The maximum number of partitions that can be used by Spark |
| simultaneously, both for spark_to_jdbc and jdbc_to_spark |
| operations. This will also cap the number of JDBC connections |
| that can be opened |
| :param partition_column: (jdbc_to_spark-only) A numeric column to be used to |
| partition the metastore table by. If specified, you must |
| also specify: |
| num_partitions, lower_bound, upper_bound |
| :param lower_bound: (jdbc_to_spark-only) Lower bound of the range of the numeric |
| partition column to fetch. If specified, you must also specify: |
| num_partitions, partition_column, upper_bound |
| :param upper_bound: (jdbc_to_spark-only) Upper bound of the range of the numeric |
| partition column to fetch. If specified, you must also specify: |
| num_partitions, partition_column, lower_bound |
| :param create_table_column_types: (spark_to_jdbc-only) The database column data types |
| to use instead of the defaults, when creating the |
| table. Data type information should be specified in |
| the same format as CREATE TABLE columns syntax |
| (e.g: "name CHAR(64), comments VARCHAR(1024)"). |
| The specified types should be valid spark sql data |
| types. |
| |
| .. py:method:: execute(self, context) |
| |
| Call the SparkSubmitHook to run the provided spark job |
| |
| |
| .. py:method:: on_kill(self) |
| |
| Override this method to cleanup subprocesses when a task instance |
| gets killed. Any use of the threading, subprocess or multiprocessing |
| module within an operator needs to be cleaned up or it will leave |
| ghost processes behind. |
| |
| |
| |