| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="impala_metadata"> |
| |
| <title>Metadata Management</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Configuring"/> |
| <data name="Category" value="Administrators"/> |
| <data name="Category" value="Developers"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| This topic describes various knobs you can use to control how Impala manages its metadata |
| in order to improve performance and scalability. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="on_demand_metadata"> |
| |
| <title>On-demand Metadata</title> |
| |
| <conbody> |
| |
| <p> |
| In previous versions of Impala, every coordinator kept a replica of all the cache in |
| <codeph>catalogd</codeph>, consuming large memory on each coordinator with no option to |
| evict. Metadata always propagated through the <codeph>statestored</codeph> and suffers |
| from head-of-line blocking, for example, one user loading a big table blocking another |
| user loading a small table. |
| </p> |
| |
| <p> |
| With this new feature, the coordinators pull metadata as needed from |
| <codeph>catalogd</codeph> and cache it locally. The cached metadata gets evicted |
| automatically under memory pressure. |
| </p> |
| |
| <p> |
| The granularity of on-demand metadata fetches is now at the partition level between the |
| coordinator and <codeph>catalogd</codeph>. Common use cases like add/drop partitions do |
| not trigger unnecessary serialization/deserialization of large metadata. |
| </p> |
| |
| <p> |
| This feature is disabled by default. |
| </p> |
| |
| <p> |
| The feature can be used in either of the following modes. |
| <dl> |
| <dlentry> |
| |
| <dt> |
| Metadata on-demand mode |
| </dt> |
| |
| <dd> |
| In this mode, all coordinators use the metadata on-demand. |
| </dd> |
| |
| <dd> |
| Set the following on <codeph>catalogd</codeph>: |
| <codeblock>--catalog_topic_mode=minimal</codeblock> |
| </dd> |
| |
| <dd> |
| Set the following on all <codeph>impalad</codeph> coordinators: |
| <codeblock>--use_local_catalog=true</codeblock> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry> |
| |
| <dt> |
| Mixed mode |
| </dt> |
| |
| <dd> |
| In this mode, only some coordinators are enabled to use the metadata on-demand. |
| </dd> |
| |
| <dd> |
| We recommend that you use the mixed mode only for testing local catalog’s impact |
| on heap usage. |
| </dd> |
| |
| <dd> |
| Set the following on <codeph>catalogd</codeph>: |
| <codeblock>--catalog_topic_mode=mixed</codeblock> |
| </dd> |
| |
| <dd> |
| Set the following on <codeph>impalad</codeph> coordinators with metdadata |
| on-demand: |
| <codeblock>--use_local_catalog=true </codeblock> |
| </dd> |
| |
| </dlentry> |
| <dlentry> |
| <dt>Flags related to <codeph>use_local_catalog</codeph></dt> |
| <dd>When <codeph>use_local_catalog</codeph> is enabled or set to <codeph>True</codeph> on the impalad |
| coordinators the following list of flags configure various parameters as described below. It is not |
| recommended to change the default values on these flags. |
| </dd> |
| <dd> |
| <ul> |
| <li>The flag <codeph>local_catalog_cache_mb</codeph> (defaults to -1) configures the |
| size of the catalog cache within each coordinator. With the default set to -1, the |
| cache is auto-configured to 60% of the configured Java heap size. Note that the |
| Java heap size is distinct from and typically smaller than the overall Impala |
| memory limit.</li> |
| <li>The flag <codeph>local_catalog_cache_expiration_s</codeph> (defaults to 3600) configures the |
| expiration time of the catalog cache within each impalad. Even if the configured |
| cache capacity has not been reached, items are removed from the cache if they have not |
| been accessed in the defined amount of time.</li> |
| <li>The flag <codeph>local_catalog_max_fetch_retries</codeph> (defaults to 40) configures |
| the maximum number of retries needed for queries to fetch a metadata object from the impalad |
| coordinator's local catalog cache.</li> |
| </ul> |
| </dd> |
| </dlentry> |
| </dl> |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="auto_invalidate_metadata"> |
| |
| <title>Automatic Invalidation of Metadata Cache</title> |
| |
| <conbody> |
| |
| <p> |
| To keep the size of metadata bounded, <codeph>catalogd</codeph> periodically scans all |
| the tables and invalidates those not recently used. There are two types of |
| configurations for <codeph>catalogd</codeph> and <codeph>impalad</codeph>. |
| </p> |
| |
| <dl> |
| <dlentry> |
| |
| <dt> |
| Time-based cache invalidation |
| </dt> |
| |
| <dd> |
| <codeph>Catalogd</codeph> invalidates tables that are not recently used in the |
| specified time period (in seconds). |
| </dd> |
| |
| <dd> |
| The <codeph>‑‑invalidate_tables_timeout_s</codeph> flag needs to be |
| applied to both <codeph>impalad</codeph> and <codeph>catalogd</codeph>. |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry> |
| |
| <dt> |
| Memory-based cache invalidation |
| </dt> |
| |
| <dd> |
| When the memory pressure reaches 60% of JVM heap size after a Java garbage |
| collection in <codeph>catalogd</codeph>, Impala invalidates 10% of the least |
| recently used tables. |
| </dd> |
| |
| <dd> |
| The <codeph>‑‑invalidate_tables_on_memory_pressure</codeph> flag needs |
| to be applied to both <codeph>impalad</codeph> and <codeph>catalogd</codeph>. |
| </dd> |
| |
| </dlentry> |
| </dl> |
| |
| <p> |
| Automatic invalidation of metadata provides more stability with lower chances of running |
| out of memory, but the feature could potentially cause performance issues and may |
| require tuning. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="auto_poll_hms_notification"> |
| |
| <title>Automatic Invalidation/Refresh of Metadata</title> |
| |
| <conbody> |
| |
| <p> |
| When tools such as Hive and Spark are used to process the raw data ingested into Hive |
| tables, new HMS metadata (database, tables, partitions) and filesystem metadata (new |
| files in existing partitions/tables) is generated. In previous versions of Impala, in |
| order to pick up this new information, Impala users needed to manually issue an |
| <codeph>INVALIDATE</codeph> or <codeph>REFRESH</codeph> commands. |
| </p> |
| |
| <p> |
| When automatic invalidate/refresh of metadata is enabled, <codeph>catalogd</codeph> |
| polls Hive Metastore (HMS) notification events at a configurable interval and processes |
| the following changes: |
| </p> |
| |
| <note> |
| This is a preview feature in <keyword keyref="impala33_full"/> and not generally |
| available. |
| </note> |
| |
| <ul> |
| <li> |
| Invalidates the tables when it receives the <codeph>ALTER TABLE</codeph> event. |
| </li> |
| |
| <li> |
| Refreshes the partition when it receives the <codeph>ALTER</codeph>, |
| <codeph>ADD</codeph>, or <codeph>DROP</codeph> partitions. |
| </li> |
| |
| <li> |
| Adds the tables or databases when it receives the <codeph>CREATE TABLE</codeph> or |
| <codeph>CREATE DATABASE</codeph> events. |
| </li> |
| |
| <li> |
| Removes the tables from <codeph>catalogd</codeph> when it receives the <codeph>DROP |
| TABLE</codeph> or <codeph>DROP DATABASE</codeph> events. |
| </li> |
| |
| <li> |
| Refreshes the table and partitions when it receives the <codeph>INSERT</codeph> |
| events. |
| <p> |
| If the table is not loaded at the time of processing the <codeph>INSERT</codeph> |
| event, the event processor does not need to refresh the table and skips it. |
| </p> |
| </li> |
| |
| <li> |
| Changes the database and updates <codeph>catalogd</codeph> when it receives the |
| <codeph>ALTER DATABASE</codeph> events. The following changes are supported. This |
| event does not invalidate the tables in the database. |
| <ul> |
| <li> |
| Change the database properties |
| </li> |
| |
| <li> |
| Change the comment on the database |
| </li> |
| |
| <li> |
| Change the owner of the database |
| </li> |
| |
| <li> |
| Change the default location of the database |
| <p> |
| Changing the default location of the database does not move the tables of that |
| database to the new location. Only the new tables which are created subsequently |
| use the default location of the database in case it is not provided in the |
| create table statement. |
| </p> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| |
| <p> |
| This feature is controlled by the |
| <codeph>‑‑hms_event_polling_interval_s</codeph> flag. Start the |
| <codeph>catalogd</codeph> with the <codeph>‑‑hms_event_polling_interval_s</codeph> |
| flag set to a positive integer to enable the feature and set the polling frequency in |
| seconds. We recommend the value to be less than 5 seconds. |
| </p> |
| |
| <p> |
| The following use cases are not supported: |
| </p> |
| |
| <ul> |
| <li> |
| When you bypass HMS and add or remove data into table by adding files directly on the |
| filesystem, HMS does not generate the <codeph>INSERT</codeph> event, and the event |
| processor will not invalidate the corresponding table or refresh the corresponding |
| partition. |
| <p> |
| It is recommended that you use the <codeph>LOAD DATA</codeph> command to do the data |
| load in such cases, so that event processor can act on the events generated by the |
| <codeph>LOAD</codeph> command. |
| </p> |
| </li> |
| |
| <li> |
| The Spark API that saves data to a specified location does not generate events in HMS, |
| thus is not supported. For example: |
| <codeblock>Seq((1, 2)).toDF("i", "j").write.save("/user/hive/warehouse/spark_etl.db/customers/date=01012019")</codeblock> |
| </li> |
| </ul> |
| |
| <p> |
| This feature is turned off by default with the |
| <codeph>‑‑hms_event_polling_interval_s</codeph> flag set to |
| <codeph>0</codeph>. |
| </p> |
| |
| </conbody> |
| |
| <concept id="configure_event_based_metadata_sync"> |
| |
| <title>Configure HMS for Event Based Automatic Metadata Sync</title> |
| |
| <conbody> |
| |
| <p> |
| To use the HMS event based metadata sync: |
| </p> |
| |
| <ol> |
| <li> |
| Add the following entries to the <codeph>hive-site.xml</codeph> of the Hive |
| Metastore service. |
| <codeblock> <property> |
| <name>hive.metastore.transactional.event.listeners</name> |
| <value>org.apache.hive.hcatalog.listener.DbNotificationListener</value> |
| |
| <name>hive.metastore.dml.events</name> |
| <value>true</true> |
| </property></codeblock> |
| </li> |
| |
| <li> |
| Save <codeph>hive-site.xml</codeph>. |
| </li> |
| |
| <li> |
| Set the <codeph>hive.metastore.dml.events</codeph> configuration key to |
| <codeph>true</codeph> in HiveServer2 service's <codeph>hive-site.xml</codeph>. This |
| configuration key needs to be set to <codeph>true</codeph> in both Hive services, |
| HiveServer2 and Hive Metastore. |
| </li> |
| |
| <li> |
| If applicable, set the <codeph>hive.metastore.dml.events</codeph> configuration key |
| to <codeph>true</codeph> in <codeph>hive-site.xml</codeph> used by the Spark |
| applications (typically, <codeph>/etc/hive/conf/hive-site.xml</codeph>) so that the |
| <codeph>INSERT</codeph> events are generated when the Spark application inserts data |
| into existing tables and partitions. |
| </li> |
| |
| <li> |
| Restart the HiveServer2, Hive Metastore, and Spark (if applicable) services. |
| </li> |
| </ol> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="disable_event_based_metadata_sync"> |
| |
| <title>Disable Event Based Automatic Metadata Sync</title> |
| |
| <conbody> |
| |
| <p> |
| When the <codeph>‑‑hms_event_polling_interval_s</codeph> flag is set to a non-zero |
| value for your <codeph>catalogd</codeph>, the event-based automatic invalidation is |
| enabled for all databases and tables. If you wish to have the fine-grained control on |
| which tables or databases need to be synced using events, you can use the |
| <codeph>impala.disableHmsSync</codeph> property to disable the event processing at the |
| table or database level. |
| </p> |
| |
| <p> |
| When you add the <codeph>DBPROPERTIES</codeph> or <codeph>TBLPROPERTIES</codeph> with |
| the <codeph>impala.disableHmsSync</codeph> key, the HMS event based sync is turned on |
| or off. The value of the <codeph>impala.disableHmsSync</codeph> property determines if |
| the event processing needs to be disabled for a particular table or database. |
| </p> |
| |
| <ul> |
| <li> |
| If <codeph>'impala.disableHmsSync'='true'</codeph>, the events for that table or |
| database are ignored and not synced with HMS. |
| </li> |
| |
| <li> |
| If <codeph>'impala.disableHmsSync'='false'</codeph> or if |
| <codeph>impala.disableHmsSync</codeph> is not set, the automatic sync with HMS is |
| enabled if the <codeph>‑‑hms_event_polling_interval_s</codeph> global flag is |
| set to non-zero. |
| </li> |
| </ul> |
| |
| <ul> |
| <li> |
| To disable the event based HMS sync for a new database, set the |
| <codeph>impala.disableHmsSync</codeph> database properties in Hive as currently, |
| Impala does not support setting database properties: |
| <codeblock>CREATE DATABASE <name> WITH DBPROPERTIES ('impala.disableHmsSync'='true');</codeblock> |
| </li> |
| |
| <li> |
| To enable or disable the event based HMS sync for a table: |
| <codeblock>CREATE TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false');</codeblock> |
| </li> |
| |
| <li> |
| To change the event based HMS sync at the table level: |
| <codeblock>ALTER TABLE <name> WITH TBLPROPERTIES ('impala.disableHmsSync'='true' | 'false');</codeblock> |
| </li> |
| </ul> |
| |
| <p> |
| When both table and database level properties are set, the table level property takes |
| precedence. If the table level property is not set, then the database level property |
| is used to evaluate if the event needs to be processed or not. |
| </p> |
| |
| <p> |
| If the property is changed from <codeph>true</codeph> (meaning events are skipped) to |
| <codeph>false</codeph> (meaning events are not skipped), you need to issue a manual |
| <codeph>INVALIDATE METADATA</codeph> command to reset event processor because it |
| doesn't know how many events have been skipped in the past and cannot know if the |
| object in the event is the latest. In such a case, the status of the event processor |
| changes to <codeph>NEEDS_INVALIDATE</codeph>. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="event_processor_metrics"> |
| |
| <title>Metrics for Event Based Automatic Metadata Sync</title> |
| |
| <conbody> |
| |
| <p> |
| You can use the web UI of the <codeph>catalogd</codeph> to check the state of the |
| automatic invalidate event processor. |
| </p> |
| |
| <p> |
| Under the web UI, there are two pages that presents the metrics for HMS event |
| processor that is responsible for the event based automatic metadata sync. |
| <ul> |
| <li> |
| <b>/metrics#events</b> |
| </li> |
| |
| <li> |
| <b>/events</b> |
| <p> |
| This provides a detailed view of the metrics of the event processor, including |
| min, max, mean, median, of the durations and rate metrics for all the counters |
| listed on the <b>/metrics#events</b> page. |
| </p> |
| </li> |
| </ul> |
| </p> |
| |
| </conbody> |
| |
| <concept id="concept_gch_xzm_1hb"> |
| |
| <title>/metrics#events Page</title> |
| |
| <conbody> |
| |
| <p> |
| The <b>/metrics#events</b> page provides the following metrics about the HMS event |
| processor. |
| </p> |
| |
| <table id="events-tbl"> |
| <tgroup cols="2"> |
| <colspec colnum="1" colname="col1" colwidth="1*"/> |
| <colspec colnum="2" colname="col3" colwidth="2.58*"/> |
| <thead> |
| <row> |
| <entry> |
| Name |
| </entry> |
| <entry> |
| Description |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry> |
| events-processor.avg-events-fetch-duration |
| </entry> |
| <entry> |
| Average duration to fetch a batch of events and process it. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| events-processor.avg-events-process-duration |
| </entry> |
| <entry> |
| Average time taken to process a batch of events received from the Metastore. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| events-processor.events-received |
| </entry> |
| <entry> |
| Total number of the Metastore events received. |
| </entry> |
| </row> |
| <row> |
| <entry> |
| events-processor.events-received-15min-rate |
| </entry> |
| <entry> |
| Exponentially weighted moving average (EWMA) of number of events received in |
| last 15 min. |
| |
| <p> |
| This rate of events can be used to determine if there are spikes in event |
| processor activity during certain hours of the day. |
| </p> |
| </entry> |
| </row> |
| <row> |
| <entry> |
| events-processor.events-received-1min-rate |
| </entry> |
| <entry> |
| Exponentially weighted moving average (EWMA) of number of events received in |
| last 1 min. |
| |
| <p> |
| This rate of events can be used to determine if there are spikes in event |
| processor activity during certain hours of the day. |
| </p> |
| </entry> |
| </row> |
| <row> |
| <entry> |
| events-processor.events-received-5min-rate |
| </entry> |
| <entry> |
| Exponentially weighted moving average (EWMA) of number of events received in |
| last 5 min. |
| |
| <p> |
| This rate of events can be used to determine if there are spikes in event |
| processor activity during certain hours of the day. |
| </p> |
| </entry> |
| </row> |
| <row> |
| <entry> |
| events-processor.events-skipped |
| </entry> |
| <entry> |
| Total number of the Metastore events skipped. |
| |
| <p> |
| Events can be skipped based on certain flags are table and database level. |
| You can use this metric to make decisions, such as: |
| <ul> |
| <li> |
| If most of the events are being skipped, see if you might just turn |
| off the event processing. |
| </li> |
| |
| <li> |
| If most of the events are not skipped, see if you need to add flags on |
| certain databases. |
| </li> |
| </ul> |
| </p> |
| </entry> |
| </row> |
| <row> |
| <entry> |
| events-processor.status |
| </entry> |
| <entry> |
| Metastore event processor status to see if there are events being received |
| or not. Possible states are: |
| |
| <ul> |
| <li> |
| <codeph>PAUSED</codeph> |
| <p> |
| The event processor is paused because catalog is being reset |
| concurrently. |
| </p> |
| </li> |
| |
| <li> |
| <codeph>ACTIVE</codeph> |
| <p> |
| The event processor is scheduled at a given frequency. |
| </p> |
| </li> |
| |
| <li> |
| <codeph>ERROR</codeph> |
| </li> |
| |
| <li> |
| The event processor is in error state and event processing has stopped. |
| </li> |
| |
| <li> |
| <codeph>NEEDS_INVALIDATE</codeph> |
| <p> |
| The event processor could not resolve certain events and needs a |
| manual <codeph>INVALIDATE</codeph> command to reset the state. |
| </p> |
| </li> |
| |
| <li> |
| <codeph>STOPPED</codeph> |
| <p> |
| The event processing has been shutdown. No events will be processed. |
| </p> |
| </li> |
| |
| <li> |
| <codeph>DISABLED</codeph> |
| <p> |
| The event processor is not configured to run. |
| </p> |
| </li> |
| </ul> |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |
| |
| </concept> |
| |
| </concept> |