blob: 9ec8fe1ff1a92028c25258e759f40621fc5933c6 [file] [log] [blame]
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
= Using the Hive Metastore with Kudu
:author: Kudu Team
:imagesdir: ./images
:icons: font
:toc: left
:toclevels: 3
:doctype: book
:backend: html5
:sectlinks:
:experimental:
[[hive_metastore]]
## Overview
Kudu has an optional feature which allows it to integrate its own catalog with
the Hive Metastore (HMS). The HMS is the de-facto standard catalog and metadata
provider in the Hadoop ecosystem. When the HMS integration is enabled, Kudu
tables can be discovered and used by external HMS-aware tools, even if they are
not otherwise aware of or integrated with Kudu. Additionally, these components
can use the HMS to discover necessary information to connect to the Kudu
cluster which owns the table, such as the Kudu master addresses.
## Databases and Table Names
With the Hive Metastore integration disabled, Kudu presents tables as a single
flat namespace, with no hierarchy or concept of a database. Additionally,
Kudu's only restriction on table names is that they be a valid UTF-8 encoded
string. When the HMS integration is enabled in Kudu, both of these properties
change in order to match the HMS model: the table name must indicate the
table's membership of a Hive database, and table name identifiers (i.e. the
table name and database name) are subject to the Hive table name identifier
constraints.
### Databases
Hive has the concept of a database, which is a collection of individual tables.
Each database forms its own independent namespace of table names. In order to
fit into this model, Kudu tables must be assigned a database when the HMS
integration is enabled. No new APIs have been added to create or delete
databases, nor are there APIs to assign an existing Kudu table to a database.
Instead, a new convention has been introduced that Kudu table names must be in
the format `<hive-database-name>.<hive-table-name>`. Thus, databases are an
implicit part of the Kudu table name. By including databases as an implicit
part of the Kudu table name, existing applications that use Kudu tables can
operate on non-HMS-integrated and HMS-integrated table names with minimal or no
changes.
Kudu provides no additional tooling to create or drop Hive databases.
Administrators or users should use existing Hive tools such as the Beeline
Shell or Impala to do so.
### Naming Constraints
When the Hive Metastore integration is enabled, the database and table names of
Kudu tables must follow the Hive Metastore naming constraints. Namely, the
database and table name must contain only alphanumeric ASCII characters and
underscores (`_`).
NOTE: When the `hive.support.special.characters.tablename` Hive configuration
is `true`, the forward-slash (`/`) character in table name identifiers (i.e. the
table name and database name) is also supported.
Additionally, the Hive Metastore does not enforce case sensitivity for table
name identifiers. As such, when enabled, Kudu will follow suit and disallow
tables from being created when one already exists whose table name identifier
differs only by case. Operations that open, alter, or drop tables will also be
case-insensitive for the table name identifiers.
WARNING: Given the case insensitivity upon enabling the integration, if
multiple Kudu tables exist whose names only differ by case, the Kudu master(s)
will fail to start up. Be sure to rename such conflicting tables before
enabling the Hive Metastore integration.
[[metadata_sync]]
### Metadata Synchronization
When the Hive Metastore integration is enabled, Kudu will automatically
synchronize metadata changes to Kudu tables between Kudu and the HMS. As such,
it is important to always ensure that the Kudu and HMS have a consistent view
of existing tables, using the administrative tools described in the below
section. Failure to do so may result in issues like Kudu tables not being
discoverable or usable by external, HMS-aware components (e.g. Apache Sentry,
Apache Impala).
NOTE: the Hive Metastore automatically creates directories for Kudu tables.
These directories are benign and can safely be ignored.
Impala has notions of internal and external Kudu tables. When dropping an
internal table from Impala, the table's data is dropped in Kudu; in contrast
when dropping an external table, the table's data is not dropped in Kudu.
External tables may refer to tables by names that are different from the names
of the underlying Kudu tables, while internal tables must use the same names as
those stored in Kudu. Additionally, multiple external tables may refer to the
same underlying Kudu table. Thus, since external tables may not map one-to-one
with Kudu tables, the Hive Metastore integration and tooling will only
automatically synchronize metadata for internal tables. See the
<<kudu_impala_integration.adoc#kudu_impala,Kudu Impala integration documentation>>
for more information about table types in Impala
[[enabling-the-hive-metastore-integration]]
## Enabling the Hive Metastore Integration
WARNING: Before enabling the Hive Metastore integration on an existing cluster,
make sure to upgrade any tables that may exist in Kudu's or in the HMS's
catalog. See <<upgrading-tables>> for more details.
* When the Hive Metastore is configured with fine-grained authorization, the
Kudu admin needs to be able to access and modify directories that are created
for Kudu by the HMS. This can be done by adding the Kudu admin user to the group
of the Hive service users, e.g. by running `usermod -aG hive kudu` on the HMS
nodes.
* Configure the Hive Metastore to include the notification event listener and
the Kudu HMS plugin, to allow altering and dropping columns, and to add full
Thrift objects in notifications. Add the following values to the HMS
configuration in `hive-site.xml`:
```xml
<property>
<name>hive.metastore.transactional.event.listeners</name>
<value>
org.apache.hive.hcatalog.listener.DbNotificationListener,
org.apache.kudu.hive.metastore.KuduMetastorePlugin
</value>
</property>
<property>
<name>hive.metastore.disallow.incompatible.col.type.changes</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.notifications.add.thrift.objects</name>
<value>true</value>
</property>
```
* After building Kudu from source, add the `hms-plugin.jar` found under the build
directory (e.g. `build/release/bin`) to the HMS classpath.
* Restart the HMS.
* Enable the Hive Metastore integration in Kudu with the following
configuration properties for the Kudu master(s):
```
--hive_metastore_uris=<HMS Thrift URI(s)>
--hive_metastore_sasl_enabled=<value of the Hive Metastore's hive.metastore.sasl.enabled configuration>
```
NOTE: In a secured cluster, in which `--hive_metastore_sasl_enabled` is set to
true, `--hive_metastore_kerberos_principal` must match the primary portion of
`hive.metastore.kerberos.principal` in the Hive Metastore configuration.
* Restart the Kudu master(s).
## Administrative Tools
Kudu provides the command line tools `kudu hms list`, `kudu hms precheck`,
`kudu hms check`, and `kudu hms fix` to allow administrators to find and fix
metadata inconsistencies between the internal Kudu catalog and the Hive
Metastore catalog, during the upgrade process described below or during the
normal operation of a Kudu cluster.
`kudu hms` tools should be run from the command line as the Kudu admin user.
They require the full list of master addresses to be specified:
[source,bash]
----
$ sudo -u kudu kudu hms check master-name-1:7051,master-name-2:7051,master-name-3:7051
----
To see a full list of the options available with the `kudu hms` tool, use the
`--help` flag.
NOTE: When fine-grained authorization is enabled, the Kudu admin user, commonly
"kudu", needs to have access to all the Kudu tables to be
able to run the `kudu hms` tools. This can be done by configuring the user as a
trusted user via the `--trusted_user_acl` master configuration. See
<<security.adoc#trusted-users,here>> for more information about trusted users.
NOTE: If the Hive Metastore is configured with fine-grained authorization, the
Kudu admin user needs to have read and write privileges on HMS table entries.
Configured this in the Hive Metastore. For Apache Sentry this can be configured
using the `sentry.metastore.service.users` property.
### `kudu hms list`
The `kudu hms list` tool scans the Hive Metastore catalog, and lists the HMS
entries (including table name and type) for Kudu tables, as indicated by their
HMS storage handler.
### `kudu hms precheck`
The `kudu hms precheck` tool scans the Kudu catalog and validates that if there
are multiple Kudu tables whose names only differ by case and logs the conflicted
table names.
### `kudu hms check`
The `kudu hms check` tool scans the Kudu and Hive Metastore catalogs, and
validates that the two catalogs agree on what Kudu tables exist. The tool will
make suggestions on how to fix any inconsistencies that are found. Typically,
the suggestion will be to run the `kudu hms fix` tool, however some certain
inconsistencies require using Impala Shell for fixing.
### `kudu hms fix`
The `kudu hms fix` tool analyzes the Kudu and HMS catalogs and attempts to fix
any automatically-fixable issues, for instance, by creating a table entry in
the HMS for each Kudu table that doesn't already have one. The `--dryrun` option
shows the proposed fix instead of actually executing it. When no automatic fix
is available, it will make suggestions on how a manual fix can help.
NOTE: The `kudu hms fix` tool will not automatically fix Impala external tables
for the reasons described above. It is instead recommended to fix issues with
external tables by dropping and recreating them.
### `kudu hms downgrade`
The `kudu hms downgrade` downgrades the metadata to legacy format for Kudu and
the Hive Metastores. It is discouraged to use unless necessary, since the legacy
format can be deprecated in future releases.
[[upgrading-tables]]
## Upgrading Existing Tables
Before enabling the Kudu-HMS integration, it is important to ensure that the
Kudu and HMS start with a consistent view of existing tables. This may entail
renaming Kudu tables to conform to the Hive naming constraints. This detailed
workflow describes how to upgrade existing tables before enabling the Hive
Metastore integration.
### Prepare for the Upgrade
. Establish a maintenance window. During this time the Kudu cluster will still be
available, but tables in Kudu and the Hive Metastore may be altered or
renamed as a part of the upgrade process.
. Make note of all external tables using the following command and drop them. This reduces
the chance of having naming conflicts with Kudu tables which can lead to errors during
upgrading process. It also helps in cases where a catalog upgrade breaks
external tables, due to the underlying Kudu tables being renamed. The
external tables can be recreated after upgrade is complete.
+
[source,bash]
----
$ sudo -u kudu kudu hms list master-name-1:7051,master-name-2:7051,master-name-3:7051
----
### Perform the Upgrade
. Run the `kudu hms precheck` tool to ensure no Kudu tables only differ by
case. If the tool does not report any warnings, you can skip the next step.
+
[source,bash]
----
$ sudo -u kudu kudu hms precheck master-name-1:7051,master-name-2:7051,master-name-3:7051
----
. If the `kudu hms precheck` tool reports conflicting tables, rename these to
case-insensitive unique names using the following command:
+
[source,bash]
----
$ sudo -u kudu kudu table rename_table master-name-1:7051,master-name-2:7051,master-name-3:7051 <conflicting_table_name> <new_table_name>
----
. Run the `kudu hms check` tool using the following command. If the tool does
not report any catalog inconsistencies, skip to Step 7 below.
+
[source,bash]
----
$ sudo -u kudu kudu hms check master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=<hive_metastore_uris> [--ignore_other_clusters=<ignores_other_clusters>]
----
+
WARNING: By default, the `kudu hms` tools will ignore metadata in the HMS that
refer to a different Kudu cluster than that being operated on, as indicated by
having different masters specified. The tools compare the value of the
`kudu.master_addresses` table property (either supplied at table creation or as
`--kudu_master_hosts` on impalad daemons) in each HMS metadata entry against
the RPC endpoints (including the ports) of the Kudu masters. To have the
tooling account for and fix metadata entries with different master RPC
endpoints specified (e.g. if ports are not specified in the HMS), supply
`--ignore_other_clusters=false` as an argument to the `kud hms check` and `fix`
tools.
+
Example::
+
----
$ sudo -u kudu kudu hms check master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=thrift://hive-metastore:9083 --ignore_other_clusters=false
----
+
. If the `kudu hms check` tool reports an inconsistent catalog, perform a
dry-run of the `kudu hms fix` tool to understand how the tool will attempt to
address the automatically-fixable issues.
+
[source,bash]
----
$ sudo -u kudu kudu hms fix master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=<hive_metastore_uris> --dryrun=true [--ignore_other_clusters=<ignore_other_clusters>]
----
Example::
+
----
$ sudo -u kudu kudu hms check master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=thrift://hive-metastore:9083 --dryrun=true --ignore_other_clusters=false
----
+
. Manually fix any issues that are reported by the check tool that cannot
be automatically fixed. For example, rename any tables with names that are not
Hive-conformant.
. Run `kudu hms fix` tool to automatically fix all the remaining issues.
+
[source,bash]
----
$ sudo -u kudu kudu hms fix master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=<hive_metastore_uris> [--drop_orphan_hms_tables=<drops_orphan_hms_tables>] [--ignore_other_clusters=<ignore_other_clusters>]
----
+
Example::
+
----
$ sudo -u kudu kudu hms fix master-name-1:7051,master-name-2:7051,master-name-3:7051 --hive_metastore_uris=thrift://hive-metastore:9083 --ignore_other_clusters=false
----
+
NOTE: The `--drop_orphan_hms_tables` argument indicates whether to drop orphan
Hive Metastore tables that refer to non-existent Kudu tables. Due to
link:https://issues.apache.org/jira/browse/KUDU-2883[KUDU-2883] this option may
fail to drop HMS entries that have no table ID. A workaround to this is to drop
the table via Impala Shell.
. Recreate any external tables that were dropped when preparing for the upgrade
by using Impala Shell.
. Enable the Hive Metastore Integration as described
<<enabling-the-hive-metastore-integration>>.