blob: 1152e4c47228252e0d8d7e817b0a04b12d809fd3 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
`Hive`_
==========
You may know that the Apache Spark has built-in support for accessing Hive tables, it works well in most cases,
but is limited to one Hive Metastore. The Kyuubi Spark Hive connector(KSHC) implemented a Hive connector based
on Spark DataSource V2 API, supports accessing multiple Hive Metastore in a single Spark application.
Hive Integration
----------------
To enable the integration of Kyuubi Spark SQL engine and Hive connector through
Spark DataSource V2 API, you need to:
- Referencing the Hive connector :ref:`dependencies<kyuubi-hive-deps>`
- Setting the Spark catalog :ref:`configurations<kyuubi-hive-conf>`
.. _kyuubi-hive-deps:
Dependencies
************
The **classpath** of Kyuubi Spark SQL engine with Hive connector supported consists of
1. kyuubi-spark-sql-engine-\ |release|\ _2.12.jar, the engine jar deployed with a Kyuubi distribution
2. a copy of Spark distribution
3. kyuubi-spark-connector-hive_2.12-\ |release|\ , which can be found in the `Maven Central`_
In order to make the Hive connector packages visible for the runtime classpath of engines, we can use one of these methods:
1. Put the Kyuubi Hive connector packages into ``$SPARK_HOME/jars`` directly
2. Set ``spark.jars=/path/to/kyuubi-hive-connector``
.. note::
Starting from v1.9.2 and v1.10.0, KSHC jars available in the `Maven Central`_ guarantee binary compatibility across
Spark versions, namely, Spark 3.3 onwards.
.. _kyuubi-hive-conf:
Configurations
**************
To activate functionality of Kyuubi Spark Hive connector, we can set the following configurations:
.. code-block:: properties
spark.sql.catalog.hive_catalog org.apache.kyuubi.spark.connector.hive.HiveTableCatalog
spark.sql.catalog.hive_catalog.hive.metastore.uris thrift://metastore-host:port
spark.sql.catalog.hive_catalog.<other.hive.conf> <value>
spark.sql.catalog.hive_catalog.<other.hadoop.conf> <value>
Hive Connector Operations
-------------------------
Taking ``CREATE NAMESPACE`` as a example,
.. code-block:: sql
CREATE NAMESPACE ns;
Taking ``CREATE TABLE`` as a example,
.. code-block:: sql
CREATE TABLE hive_catalog.ns.foo (
id bigint COMMENT 'unique id',
data string)
USING parquet;
Taking ``SELECT`` as a example,
.. code-block:: sql
SELECT * FROM hive_catalog.ns.foo;
Taking ``INSERT`` as a example,
.. code-block:: sql
INSERT INTO hive_catalog.ns.foo VALUES (1, 'a'), (2, 'b'), (3, 'c');
Taking ``DROP TABLE`` as a example,
.. code-block:: sql
DROP TABLE hive_catalog.ns.foo;
Taking ``DROP NAMESPACE`` as a example,
.. code-block:: sql
DROP NAMESPACE hive_catalog.ns;
Advanced Usages
***************
Though KSHC is a pure Spark DataSource V2 connector which isn't coupled with Kyuubi deployment, due to the
implementation inside ``spark-sql``, you should not expect KSHC works properly with ``spark-sql``, and
any issues caused by such a combination usage won't be considered at this time. Instead, it's recommended
using BeeLine with Kyuubi as a drop-in replacement for ``spark-sql``, or switching to ``spark-shell``.
KSHC supports accessing Kerberized Hive Metastore and HDFS, by using keytab, or TGT cache, or Delegation Token.
It's not expected to work properly with multiple KDC instances, the limitation comes from JDK Krb5LoginModule,
for such cases, consider setting up Cross-Realm Kerberos trusts, then you just need to talk with one KDC.
For HMS Thrift API used by Spark, it's known that Hive 2.3.9 client is compatible with HMS from 2.1 to 4.0, and
Hive 2.3.10 client is compatible with HMS from 1.1 to 4.0, such version combinations should cover the most cases.
For other corner cases, KSHC also supports ``spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.jars`` and
``spark.sql.catalog.<catalog_name>.spark.sql.hive.metastore.version`` as well as the Spark built-in Hive datasource
does, you can refer to the Spark documentation for details.
Currently, KSHC has not implemented the Parquet/ORC Hive tables read/write optimization, in other words, it always
uses Hive SerDe to access Hive tables, so there might be a performance gap compared to the Spark built-in Hive
datasource, especially due to lack of support for vectorized reading. And you may hit bugs caused by Hive SerDe,
e.g. ``ParquetHiveSerDe`` can not read Parquet files that decimals are written in int-based format produced by
Spark Parquet datasource writer with ``spark.sql.parquet.writeLegacyFormat=false``.
.. _Apache Spark: https://spark.apache.org/
.. _Maven Central: https://mvnrepository.com/artifact/org.apache.kyuubi/kyuubi-spark-connector-hive