docs-archive/apache-airflow-providers-apache-spark/3.0.0/_sources/_api/airflow/providers/apache/spark/operators/spark_jdbc/index.rst.txt - airflow-site - Git at Google

 :py:mod:`airflow.providers.apache.spark.operators.spark_jdbc`
 =============================================================

 .. py:module:: airflow.providers.apache.spark.operators.spark_jdbc


 Module Contents
 ---------------

 Classes
 ~~~~~~~

 .. autoapisummary::

    airflow.providers.apache.spark.operators.spark_jdbc.SparkJDBCOperator


 .. py:class:: SparkJDBCOperator(*, spark_app_name = 'airflow-spark-jdbc', spark_conn_id = 'spark-default', spark_conf = None, spark_py_files = None, spark_files = None, spark_jars = None, num_executors = None, executor_cores = None, executor_memory = None, driver_memory = None, verbose = False, principal = None, keytab = None, cmd_type = 'spark_to_jdbc', jdbc_table = None, jdbc_conn_id = 'jdbc-default', jdbc_driver = None, metastore_table = None, jdbc_truncate = False, save_mode = None, save_format = None, batch_size = None, fetch_size = None, num_partitions = None, partition_column = None, lower_bound = None, upper_bound = None, create_table_column_types = None, **kwargs)

    Bases: :py:obj:`airflow.providers.apache.spark.operators.spark_submit.SparkSubmitOperator`

    This operator extends the SparkSubmitOperator specifically for performing data
    transfers to/from JDBC-based databases with Apache Spark. As with the
    SparkSubmitOperator, it assumes that the "spark-submit" binary is available on the
    PATH.

    .. seealso::
        For more information on how to use this operator, take a look at the guide:
        :ref:`howto/operator:SparkJDBCOperator`

    :param spark_app_name: Name of the job (default airflow-spark-jdbc)
    :param spark_conn_id: The :ref:`spark connection id <howto/connection:spark>`
        as configured in Airflow administration
    :param spark_conf: Any additional Spark configuration properties
    :param spark_py_files: Additional python files used (.zip, .egg, or .py)
    :param spark_files: Additional files to upload to the container running the job
    :param spark_jars: Additional jars to upload and add to the driver and
                       executor classpath
    :param num_executors: number of executor to run. This should be set so as to manage
                          the number of connections made with the JDBC database
    :param executor_cores: Number of cores per executor
    :param executor_memory: Memory per executor (e.g. 1000M, 2G)
    :param driver_memory: Memory allocated to the driver (e.g. 1000M, 2G)
    :param verbose: Whether to pass the verbose flag to spark-submit for debugging
    :param keytab: Full path to the file that contains the keytab
    :param principal: The name of the kerberos principal used for keytab
    :param cmd_type: Which way the data should flow. 2 possible values:
                     spark_to_jdbc: data written by spark from metastore to jdbc
                     jdbc_to_spark: data written by spark from jdbc to metastore
    :param jdbc_table: The name of the JDBC table
    :param jdbc_conn_id: Connection id used for connection to JDBC database
    :param jdbc_driver: Name of the JDBC driver to use for the JDBC connection. This
                        driver (usually a jar) should be passed in the 'jars' parameter
    :param metastore_table: The name of the metastore table,
    :param jdbc_truncate: (spark_to_jdbc only) Whether or not Spark should truncate or
                         drop and recreate the JDBC table. This only takes effect if
                         'save_mode' is set to Overwrite. Also, if the schema is
                         different, Spark cannot truncate, and will drop and recreate
    :param save_mode: The Spark save-mode to use (e.g. overwrite, append, etc.)
    :param save_format: (jdbc_to_spark-only) The Spark save-format to use (e.g. parquet)
    :param batch_size: (spark_to_jdbc only) The size of the batch to insert per round
                       trip to the JDBC database. Defaults to 1000
    :param fetch_size: (jdbc_to_spark only) The size of the batch to fetch per round trip
                       from the JDBC database. Default depends on the JDBC driver
    :param num_partitions: The maximum number of partitions that can be used by Spark
                           simultaneously, both for spark_to_jdbc and jdbc_to_spark
                           operations. This will also cap the number of JDBC connections
                           that can be opened
    :param partition_column: (jdbc_to_spark-only) A numeric column to be used to
                             partition the metastore table by. If specified, you must
                             also specify:
                             num_partitions, lower_bound, upper_bound
    :param lower_bound: (jdbc_to_spark-only) Lower bound of the range of the numeric
                        partition column to fetch. If specified, you must also specify:
                        num_partitions, partition_column, upper_bound
    :param upper_bound: (jdbc_to_spark-only) Upper bound of the range of the numeric
                        partition column to fetch. If specified, you must also specify:
                        num_partitions, partition_column, lower_bound
    :param create_table_column_types: (spark_to_jdbc-only) The database column data types
                                      to use instead of the defaults, when creating the
                                      table. Data type information should be specified in
                                      the same format as CREATE TABLE columns syntax
                                      (e.g: "name CHAR(64), comments VARCHAR(1024)").
                                      The specified types should be valid spark sql data
                                      types.

    .. py:method:: execute(self, context)

       Call the SparkSubmitHook to run the provided spark job


    .. py:method:: on_kill(self)

       Override this method to cleanup subprocesses when a task instance
       gets killed. Any use of the threading, subprocess or multiprocessing
       module within an operator needs to be cleaned up or it will leave
       ghost processes behind.
	:py:mod:`airflow.providers.apache.spark.operators.spark_jdbc`
	=============================================================

	.. py:module:: airflow.providers.apache.spark.operators.spark_jdbc


	Module Contents
	---------------

	Classes
	~~~~~~~

	.. autoapisummary::

	airflow.providers.apache.spark.operators.spark_jdbc.SparkJDBCOperator




	.. py:class:: SparkJDBCOperator(, spark_app_name = 'airflow-spark-jdbc', spark_conn_id = 'spark-default', spark_conf = None, spark_py_files = None, spark_files = None, spark_jars = None, num_executors = None, executor_cores = None, executor_memory = None, driver_memory = None, verbose = False, principal = None, keytab = None, cmd_type = 'spark_to_jdbc', jdbc_table = None, jdbc_conn_id = 'jdbc-default', jdbc_driver = None, metastore_table = None, jdbc_truncate = False, save_mode = None, save_format = None, batch_size = None, fetch_size = None, num_partitions = None, partition_column = None, lower_bound = None, upper_bound = None, create_table_column_types = None, *kwargs)

	Bases: :py:obj:`airflow.providers.apache.spark.operators.spark_submit.SparkSubmitOperator`

	This operator extends the SparkSubmitOperator specifically for performing data
	transfers to/from JDBC-based databases with Apache Spark. As with the
	SparkSubmitOperator, it assumes that the "spark-submit" binary is available on the
	PATH.

	.. seealso::
	For more information on how to use this operator, take a look at the guide:
	:ref:`howto/operator:SparkJDBCOperator`

	:param spark_app_name: Name of the job (default airflow-spark-jdbc)
	:param spark_conn_id: The :ref:`spark connection id <howto/connection:spark>`
	as configured in Airflow administration
	:param spark_conf: Any additional Spark configuration properties
	:param spark_py_files: Additional python files used (.zip, .egg, or .py)
	:param spark_files: Additional files to upload to the container running the job
	:param spark_jars: Additional jars to upload and add to the driver and
	executor classpath
	:param num_executors: number of executor to run. This should be set so as to manage
	the number of connections made with the JDBC database
	:param executor_cores: Number of cores per executor
	:param executor_memory: Memory per executor (e.g. 1000M, 2G)
	:param driver_memory: Memory allocated to the driver (e.g. 1000M, 2G)
	:param verbose: Whether to pass the verbose flag to spark-submit for debugging
	:param keytab: Full path to the file that contains the keytab
	:param principal: The name of the kerberos principal used for keytab
	:param cmd_type: Which way the data should flow. 2 possible values:
	spark_to_jdbc: data written by spark from metastore to jdbc
	jdbc_to_spark: data written by spark from jdbc to metastore
	:param jdbc_table: The name of the JDBC table
	:param jdbc_conn_id: Connection id used for connection to JDBC database
	:param jdbc_driver: Name of the JDBC driver to use for the JDBC connection. This
	driver (usually a jar) should be passed in the 'jars' parameter
	:param metastore_table: The name of the metastore table,
	:param jdbc_truncate: (spark_to_jdbc only) Whether or not Spark should truncate or
	drop and recreate the JDBC table. This only takes effect if
	'save_mode' is set to Overwrite. Also, if the schema is
	different, Spark cannot truncate, and will drop and recreate
	:param save_mode: The Spark save-mode to use (e.g. overwrite, append, etc.)
	:param save_format: (jdbc_to_spark-only) The Spark save-format to use (e.g. parquet)
	:param batch_size: (spark_to_jdbc only) The size of the batch to insert per round
	trip to the JDBC database. Defaults to 1000
	:param fetch_size: (jdbc_to_spark only) The size of the batch to fetch per round trip
	from the JDBC database. Default depends on the JDBC driver
	:param num_partitions: The maximum number of partitions that can be used by Spark
	simultaneously, both for spark_to_jdbc and jdbc_to_spark
	operations. This will also cap the number of JDBC connections
	that can be opened
	:param partition_column: (jdbc_to_spark-only) A numeric column to be used to
	partition the metastore table by. If specified, you must
	also specify:
	num_partitions, lower_bound, upper_bound
	:param lower_bound: (jdbc_to_spark-only) Lower bound of the range of the numeric
	partition column to fetch. If specified, you must also specify:
	num_partitions, partition_column, upper_bound
	:param upper_bound: (jdbc_to_spark-only) Upper bound of the range of the numeric
	partition column to fetch. If specified, you must also specify:
	num_partitions, partition_column, lower_bound
	:param create_table_column_types: (spark_to_jdbc-only) The database column data types
	to use instead of the defaults, when creating the
	table. Data type information should be specified in
	the same format as CREATE TABLE columns syntax
	(e.g: "name CHAR(64), comments VARCHAR(1024)").
	The specified types should be valid spark sql data
	types.

	.. py:method:: execute(self, context)

	Call the SparkSubmitHook to run the provided spark job


	.. py:method:: on_kill(self)

	Override this method to cleanup subprocesses when a task instance
	gets killed. Any use of the threading, subprocess or multiprocessing
	module within an operator needs to be cleaned up or it will leave
	ghost processes behind.