docs/connector/spark/hive.rst - kyuubi - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one or more
    contributor license agreements.  See the NOTICE file distributed with
    this work for additional information regarding copyright ownership.
    The ASF licenses this file to You under the Apache License, Version 2.0
    (the "License"); you may not use this file except in compliance with
    the License.  You may obtain a copy of the License at

 ..    http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing, software
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.

 `Hive`_
 ==========

 You may know that the Apache Spark has built-in support for accessing Hive tables, it works well in most cases,
 but is limited to one Hive Metastore. The Kyuubi Spark Hive connector(KSHC) implemented a Hive connector based
 on Spark DataSource V2 API, supports accessing multiple Hive Metastore in a single Spark application.

 Hive Integration
 ----------------

 To enable the integration of Kyuubi Spark SQL engine and Hive connector through
 Spark DataSource V2 API, you need to:

 - Referencing the Hive connector :ref:`dependencies<kyuubi-hive-deps>`
 - Setting the Spark catalog :ref:`configurations<kyuubi-hive-conf>`

 .. _kyuubi-hive-deps:

 Dependencies
 ************

 The **classpath** of Kyuubi Spark SQL engine with Hive connector supported consists of

 1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
 2. a copy of Spark distribution
 3. kyuubi-spark-connector-hive_2.12-\ |release|\ , which can be found in the `Maven Central`_

 In order to make the Hive connector packages visible for the runtime classpath of engines, we can use one of these methods:

 1. Put the Kyuubi Hive connector packages into ``$SPARK_HOME/jars`` directly
 2. Set ``spark.jars=/path/to/kyuubi-hive-connector``

 .. note::
    Starting from v1.9.2 and v1.10.0, KSHC jars available in the `Maven Central`_ guarantee binary compatibility across
    Spark versions, namely, Spark 3.3 onwards.

 .. _kyuubi-hive-conf:

 Configurations
 **************

 To activate functionality of Kyuubi Spark Hive connector, we can set the following configurations:

 .. code-block:: properties

    spark.sql.catalog.hive_catalog                      org.apache.kyuubi.spark.connector.hive.HiveTableCatalog
    spark.sql.catalog.hive_catalog.hive.metastore.uris  thrift://metastore-host:port
    spark.sql.catalog.hive_catalog.<other.hive.conf>    <value>
    spark.sql.catalog.hive_catalog.<other.hadoop.conf>  <value>

 Hive Connector Operations
 -------------------------

 Taking ``CREATE NAMESPACE`` as a example,

 .. code-block:: sql

    CREATE NAMESPACE ns;

 Taking ``CREATE TABLE`` as a example,

 .. code-block:: sql

    CREATE TABLE hive_catalog.ns.foo (
      id bigint COMMENT 'unique id',
      data string)
    USING parquet;

 Taking ``SELECT`` as a example,

 .. code-block:: sql

    SELECT * FROM hive_catalog.ns.foo;

 Taking ``INSERT`` as a example,

 .. code-block:: sql

    INSERT INTO hive_catalog.ns.foo VALUES (1, 'a'), (2, 'b'), (3, 'c');

 Taking ``DROP TABLE`` as a example,

 .. code-block:: sql

    DROP TABLE hive_catalog.ns.foo;

 Taking ``DROP NAMESPACE`` as a example,

 .. code-block:: sql

    DROP NAMESPACE hive_catalog.ns;

 Advanced Usages
 ***************

 Though KSHC is a pure Spark DataSource V2 connector which isn't coupled with Kyuubi deployment, due to the
 implementation inside ``spark-sql``, you should not expect KSHC works properly with ``spark-sql``, and
 any issues caused by such a combination usage won't be considered at this time. Instead, it's recommended
 using BeeLine with Kyuubi as a drop-in replacement for ``spark-sql``, or switching to ``spark-shell``.

 KSHC supports accessing Kerberized Hive Metastore and HDFS, by using keytab, or TGT cache, or Delegation Token.
 It's not expected to work properly with multiple KDC instances, the limitation comes from JDK Krb5LoginModule,
 for such cases, consider setting up Cross-Realm Kerberos trusts, then you just need to talk with one KDC.

 For HMS Thrift API used by Spark, it's known that Hive 2.3.9 client is compatible with HMS from 2.1 to 4.0, and
 Hive 2.3.10 client is compatible with HMS from 1.1 to 4.0, such version combinations should cover the most cases.
 For other corner cases, KSHC also supports ``spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.jars`` and
 ``spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.version`` as well as the Spark built-in Hive datasource
 does, you can refer to the Spark documentation for details.

 Currently, KSHC has not implemented the Parquet/ORC Hive tables read/write optimization, in other words, it always
 uses Hive SerDe to access Hive tables, so there might be a performance gap compared to the Spark built-in Hive
 datasource, especially due to lack of support for vectorized reading. And you may hit bugs caused by Hive SerDe,
 e.g. ``ParquetHiveSerDe`` can not read Parquet files that decimals are written in int-based format produced by
 Spark Parquet datasource writer with ``spark.sql.parquet.writeLegacyFormat=false``.

 .. _Apache Spark: https://spark.apache.org/
 .. _Maven Central: https://mvnrepository.com/artifact/org.apache.kyuubi/kyuubi-spark-connector-hive
	.. Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.

	`Hive`_
	==========

	You may know that the Apache Spark has built-in support for accessing Hive tables, it works well in most cases,
	but is limited to one Hive Metastore. The Kyuubi Spark Hive connector(KSHC) implemented a Hive connector based
	on Spark DataSource V2 API, supports accessing multiple Hive Metastore in a single Spark application.

	Hive Integration
	----------------

	To enable the integration of Kyuubi Spark SQL engine and Hive connector through
	Spark DataSource V2 API, you need to:

	- Referencing the Hive connector :ref:`dependencies<kyuubi-hive-deps>`
	- Setting the Spark catalog :ref:`configurations<kyuubi-hive-conf>`

	.. _kyuubi-hive-deps:

	Dependencies
	************

	The classpath of Kyuubi Spark SQL engine with Hive connector supported consists of

	1. kyuubi-spark-sql-engine-\ \|release\|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
	2. a copy of Spark distribution
	3. kyuubi-spark-connector-hive_2.12-\ \|release\|\ , which can be found in the `Maven Central`_

	In order to make the Hive connector packages visible for the runtime classpath of engines, we can use one of these methods:

	1. Put the Kyuubi Hive connector packages into ``$SPARK_HOME/jars`` directly
	2. Set ``spark.jars=/path/to/kyuubi-hive-connector``

	.. note::
	Starting from v1.9.2 and v1.10.0, KSHC jars available in the `Maven Central`_ guarantee binary compatibility across
	Spark versions, namely, Spark 3.3 onwards.

	.. _kyuubi-hive-conf:

	Configurations
	**************

	To activate functionality of Kyuubi Spark Hive connector, we can set the following configurations:

	.. code-block:: properties

	spark.sql.catalog.hive_catalog org.apache.kyuubi.spark.connector.hive.HiveTableCatalog
	spark.sql.catalog.hive_catalog.hive.metastore.uris thrift://metastore-host:port
	spark.sql.catalog.hive_catalog.<other.hive.conf> <value>
	spark.sql.catalog.hive_catalog.<other.hadoop.conf> <value>

	Hive Connector Operations
	-------------------------

	Taking ``CREATE NAMESPACE`` as a example,

	.. code-block:: sql

	CREATE NAMESPACE ns;

	Taking ``CREATE TABLE`` as a example,

	.. code-block:: sql

	CREATE TABLE hive_catalog.ns.foo (
	id bigint COMMENT 'unique id',
	data string)
	USING parquet;

	Taking ``SELECT`` as a example,

	.. code-block:: sql

	SELECT * FROM hive_catalog.ns.foo;

	Taking ``INSERT`` as a example,

	.. code-block:: sql

	INSERT INTO hive_catalog.ns.foo VALUES (1, 'a'), (2, 'b'), (3, 'c');

	Taking ``DROP TABLE`` as a example,

	.. code-block:: sql

	DROP TABLE hive_catalog.ns.foo;

	Taking ``DROP NAMESPACE`` as a example,

	.. code-block:: sql

	DROP NAMESPACE hive_catalog.ns;

	Advanced Usages
	***************

	Though KSHC is a pure Spark DataSource V2 connector which isn't coupled with Kyuubi deployment, due to the
	implementation inside ``spark-sql``, you should not expect KSHC works properly with ``spark-sql``, and
	any issues caused by such a combination usage won't be considered at this time. Instead, it's recommended
	using BeeLine with Kyuubi as a drop-in replacement for ``spark-sql``, or switching to ``spark-shell``.

	KSHC supports accessing Kerberized Hive Metastore and HDFS, by using keytab, or TGT cache, or Delegation Token.
	It's not expected to work properly with multiple KDC instances, the limitation comes from JDK Krb5LoginModule,
	for such cases, consider setting up Cross-Realm Kerberos trusts, then you just need to talk with one KDC.

	For HMS Thrift API used by Spark, it's known that Hive 2.3.9 client is compatible with HMS from 2.1 to 4.0, and
	Hive 2.3.10 client is compatible with HMS from 1.1 to 4.0, such version combinations should cover the most cases.
	For other corner cases, KSHC also supports ``spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.jars`` and
	``spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.version`` as well as the Spark built-in Hive datasource
	does, you can refer to the Spark documentation for details.

	Currently, KSHC has not implemented the Parquet/ORC Hive tables read/write optimization, in other words, it always
	uses Hive SerDe to access Hive tables, so there might be a performance gap compared to the Spark built-in Hive
	datasource, especially due to lack of support for vectorized reading. And you may hit bugs caused by Hive SerDe,
	e.g. ``ParquetHiveSerDe`` can not read Parquet files that decimals are written in int-based format produced by
	Spark Parquet datasource writer with ``spark.sql.parquet.writeLegacyFormat=false``.

	.. _Apache Spark: https://spark.apache.org/
	.. _Maven Central: https://mvnrepository.com/artifact/org.apache.kyuubi/kyuubi-spark-connector-hive