| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="refresh"> |
| |
| <title>REFRESH Statement</title> |
| <titlealts audience="PDF"><navtitle>REFRESH</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="DDL"/> |
| <data name="Category" value="Tables"/> |
| <data name="Category" value="Hive"/> |
| <data name="Category" value="Metastore"/> |
| <data name="Category" value="ETL"/> |
| <data name="Category" value="Ingest"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">REFRESH statement</indexterm> |
| To accurately respond to queries, the Impala node that acts as the coordinator (the node to which you are |
| connected through <cmdname>impala-shell</cmdname>, JDBC, or ODBC) must have current metadata about those |
| databases and tables that are referenced in Impala queries. If you are not familiar with the way Impala uses |
| metadata and how it shares the same metastore database as Hive, see |
| <xref href="impala_hadoop.xml#intro_metastore"/> for background information. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock rev="IMPALA-1683">REFRESH [<varname>db_name</varname>.]<varname>table_name</varname> [PARTITION (<varname>key_col1</varname>=<varname>val1</varname> [, <varname>key_col2</varname>=<varname>val2</varname>...])] |
| <ph rev="2.9.0 IMPALA-5259">REFRESH FUNCTIONS <varname>db_name</varname></ph> |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| Use the <codeph>REFRESH</codeph> statement to load the latest metastore metadata and block location data for |
| a particular table in these scenarios: |
| </p> |
| |
| <ul> |
| <li> |
| After loading new data files into the HDFS data directory for the table. (Once you have set up an ETL |
| pipeline to bring data into Impala on a regular basis, this is typically the most frequent reason why |
| metadata needs to be refreshed.) |
| </li> |
| |
| <li> |
| After issuing <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph>, <codeph>LOAD DATA</codeph>, or other |
| table-modifying SQL statement in Hive. |
| </li> |
| </ul> |
| |
| <note rev="2.3.0"> |
| <p rev="2.3.0"> |
| In <keyword keyref="impala23_full"/> and higher, the syntax <codeph>ALTER TABLE <varname>table_name</varname> RECOVER PARTITIONS</codeph> |
| is a faster alternative to <codeph>REFRESH</codeph> when the only change to the table data is the addition of |
| new partition directories through Hive or manual HDFS operations. |
| See <xref href="impala_alter_table.xml#alter_table"/> for details. |
| </p> |
| </note> |
| |
| <p> |
| You only need to issue the <codeph>REFRESH</codeph> statement on the node to which you connect to issue |
| queries. The coordinator node divides the work among all the Impala nodes in a cluster, and sends read |
| requests for the correct HDFS blocks without relying on the metadata on the other nodes. |
| </p> |
| |
| <p> |
| <codeph>REFRESH</codeph> reloads the metadata for the table from the metastore database, and does an |
| incremental reload of the low-level block location data to account for any new data files added to the HDFS |
| data directory for the table. It is a low-overhead, single-table operation, specifically tuned for the common |
| scenario where new data files are added to HDFS. |
| </p> |
| |
| <p> |
| Only the metadata for the specified table is flushed. The table must already exist and be known to Impala, |
| either because the <codeph>CREATE TABLE</codeph> statement was run in Impala rather than Hive, or because a |
| previous <codeph>INVALIDATE METADATA</codeph> statement caused Impala to reload its entire metadata catalog. |
| </p> |
| |
| <note> |
| <p rev="1.2"> |
| The catalog service broadcasts any changed metadata as a result of Impala |
| <codeph>ALTER TABLE</codeph>, <codeph>INSERT</codeph> and <codeph>LOAD DATA</codeph> statements to all |
| Impala nodes. Thus, the <codeph>REFRESH</codeph> statement is only required if you load data through Hive |
| or by manipulating data files in HDFS directly. See <xref href="impala_components.xml#intro_catalogd"/> for |
| more information on the catalog service. |
| </p> |
| <p rev="1.2.1"> |
| Another way to avoid inconsistency across nodes is to enable the |
| <codeph>SYNC_DDL</codeph> query option before performing a DDL statement or an <codeph>INSERT</codeph> or |
| <codeph>LOAD DATA</codeph>. |
| </p> |
| <p rev="1.1"> |
| The table name is a required parameter. To flush the metadata for all tables, use the |
| <codeph><xref href="impala_invalidate_metadata.xml#invalidate_metadata">INVALIDATE METADATA</xref></codeph> |
| command. |
| </p> |
| <p conref="../shared/impala_common.xml#common/invalidate_then_refresh"/> |
| </note> |
| |
| <p conref="../shared/impala_common.xml#common/refresh_vs_invalidate"/> |
| |
| <p> |
| A metadata update for an <codeph>impalad</codeph> instance <b>is</b> required if: |
| </p> |
| |
| <ul> |
| <li> |
| A metadata change occurs. |
| </li> |
| |
| <li> |
| <b>and</b> the change is made through Hive. |
| </li> |
| |
| <li> |
| <b>and</b> the change is made to a metastore database to which clients such as the Impala shell or ODBC directly |
| connect. |
| </li> |
| </ul> |
| |
| <p rev="1.2"> |
| A metadata update for an Impala node is <b>not</b> required after you run <codeph>ALTER TABLE</codeph>, |
| <codeph>INSERT</codeph>, or other table-modifying statement in Impala rather than Hive. Impala handles the |
| metadata synchronization automatically through the catalog service. |
| </p> |
| |
| <p> |
| Database and table metadata is typically modified by: |
| </p> |
| |
| <ul> |
| <li> |
| Hive - through <codeph>ALTER</codeph>, <codeph>CREATE</codeph>, <codeph>DROP</codeph> or |
| <codeph>INSERT</codeph> operations. |
| </li> |
| |
| <li> |
| Impalad - through <codeph>CREATE TABLE</codeph>, <codeph>ALTER TABLE</codeph>, and <codeph>INSERT</codeph> |
| operations. <ph rev="1.2">Such changes are propagated to all Impala nodes by the |
| Impala catalog service.</ph> |
| </li> |
| </ul> |
| |
| <p> |
| <codeph>REFRESH</codeph> causes the metadata for that table to be immediately reloaded. For a huge table, |
| that process could take a noticeable amount of time; but doing the refresh up front avoids an unpredictable |
| delay later, for example if the next reference to the table is during a benchmark test. |
| </p> |
| |
| <p rev="IMPALA-1683"> |
| <b>Refreshing a single partition:</b> |
| </p> |
| |
| <p rev="IMPALA-1683"> |
| In <keyword keyref="impala27_full"/> and higher, the <codeph>REFRESH</codeph> statement can apply to a single partition at a time, |
| rather than the whole table. Include the optional <codeph>PARTITION (<varname>partition_spec</varname>)</codeph> |
| clause and specify values for each of the partition key columns. |
| </p> |
| |
| <p rev="IMPALA-1683"> |
| The following examples show how to make Impala aware of data added to a single partition, after data is loaded into |
| a partition's data directory using some mechanism outside Impala, such as Hive or Spark. The partition can be one that |
| Impala created and is already aware of, or a new partition created through Hive. |
| </p> |
| |
| <codeblock rev="IMPALA-1683"><![CDATA[ |
| impala> create table p (x int) partitioned by (y int); |
| impala> insert into p (x,y) values (1,2), (2,2), (2,1); |
| impala> show partitions p; |
| +-------+-------+--------+------+... |
| | y | #Rows | #Files | Size |... |
| +-------+-------+--------+------+... |
| | 1 | -1 | 1 | 2B |... |
| | 2 | -1 | 1 | 4B |... |
| | Total | -1 | 2 | 6B |... |
| +-------+-------+--------+------+... |
| |
| -- ... Data is inserted into one of the partitions by some external mechanism ... |
| beeline> insert into p partition (y = 1) values(1000); |
| |
| impala> refresh p partition (y=1); |
| impala> select x from p where y=1; |
| +------+ |
| | x | |
| +------+ |
| | 2 | <- Original data created by Impala |
| | 1000 | <- Additional data inserted through Beeline |
| +------+ |
| ]]> |
| </codeblock> |
| |
| <p rev="IMPALA-1683"> |
| The same applies for tables with more than one partition key column. |
| The <codeph>PARTITION</codeph> clause of the <codeph>REFRESH</codeph> |
| statement must include all the partition key columns. |
| </p> |
| |
| <codeblock rev="IMPALA-1683"><![CDATA[ |
| impala> create table p2 (x int) partitioned by (y int, z int); |
| impala> insert into p2 (x,y,z) values (0,0,0), (1,2,3), (2,2,3); |
| impala> show partitions p2; |
| +-------+---+-------+--------+------+... |
| | y | z | #Rows | #Files | Size |... |
| +-------+---+-------+--------+------+... |
| | 0 | 0 | -1 | 1 | 2B |... |
| | 2 | 3 | -1 | 1 | 4B |... |
| | Total | | -1 | 2 | 6B |... |
| +-------+---+-------+--------+------+... |
| |
| -- ... Data is inserted into one of the partitions by some external mechanism ... |
| beeline> insert into p2 partition (y = 2, z = 3) values(1000); |
| |
| impala> refresh p2 partition (y=2, z=3); |
| impala> select x from p where y=2 and z = 3; |
| +------+ |
| | x | |
| +------+ |
| | 1 | <- Original data created by Impala |
| | 2 | <- Original data created by Impala |
| | 1000 | <- Additional data inserted through Beeline |
| +------+ |
| ]]> |
| </codeblock> |
| |
| <p rev="IMPALA-1683"> |
| The following examples show how specifying a nonexistent partition does not cause any error, |
| and the order of the partition key columns does not have to match the column order in the table. |
| The partition spec must include all the partition key columns; specifying an incomplete set of |
| columns does cause an error. |
| </p> |
| |
| <codeblock rev="IMPALA-1683"><![CDATA[ |
| -- Partition doesn't exist. |
| refresh p2 partition (y=0, z=3); |
| refresh p2 partition (y=0, z=-1) |
| -- Key columns specified in a different order than the table definition. |
| refresh p2 partition (z=1, y=0) |
| -- Incomplete partition spec causes an error. |
| refresh p2 partition (y=0) |
| ERROR: AnalysisException: Items in partition spec must exactly match the partition columns in the table definition: default.p2 (1 vs 2) |
| ]]> |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/sync_ddl_blurb"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example shows how you might use the <codeph>REFRESH</codeph> statement after manually adding |
| new HDFS data files to the Impala data directory for a table: |
| </p> |
| |
| <codeblock>[impalad-host:21000] > refresh t1; |
| [impalad-host:21000] > refresh t2; |
| [impalad-host:21000] > select * from t1; |
| ... |
| [impalad-host:21000] > select * from t2; |
| ... </codeblock> |
| |
| <p> |
| For more examples of using <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> with a |
| combination of Impala and Hive operations, see <xref href="impala_tutorial.xml#tutorial_impala_hive"/>. |
| </p> |
| |
| <p> |
| <b>Related impala-shell options:</b> |
| </p> |
| |
| <p rev="1.1"> |
| The <cmdname>impala-shell</cmdname> option <codeph>-r</codeph> issues an <codeph>INVALIDATE METADATA</codeph> statement |
| when starting up the shell, effectively performing a <codeph>REFRESH</codeph> of all tables. |
| Due to the expense of reloading the metadata for all tables, the <cmdname>impala-shell</cmdname> <codeph>-r</codeph> |
| option is not recommended for day-to-day use in a production environment. (This option was mainly intended as a workaround |
| for synchronization issues in very old Impala versions.) |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/permissions_blurb"/> |
| <p rev=""> |
| The user ID that the <cmdname>impalad</cmdname> daemon runs under, |
| typically the <codeph>impala</codeph> user, must have execute |
| permissions for all the relevant directories holding table data. |
| (A table could have data spread across multiple directories, |
| or in unexpected paths, if it uses partitioning or |
| specifies a <codeph>LOCATION</codeph> attribute for |
| individual partitions or the entire table.) |
| Issues with permissions might not cause an immediate error for this statement, |
| but subsequent statements such as <codeph>SELECT</codeph> |
| or <codeph>SHOW TABLE STATS</codeph> could fail. |
| </p> |
| <p rev="IMPALA-1683"> |
| All HDFS and Sentry permissions and privileges are the same whether you refresh the entire table |
| or a single partition. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/hdfs_blurb"/> |
| |
| <p> |
| The <codeph>REFRESH</codeph> command checks HDFS permissions of the underlying data files and directories, |
| caching this information so that a statement can be cancelled immediately if for example the |
| <codeph>impala</codeph> user does not have permission to write to the data directory for the table. Impala |
| reports any lack of write permissions as an <codeph>INFO</codeph> message in the log file, in case that |
| represents an oversight. If you change HDFS permissions to make data readable or writeable by the Impala |
| user, issue another <codeph>REFRESH</codeph> to make Impala aware of the change. |
| </p> |
| |
| <note conref="../shared/impala_common.xml#common/compute_stats_next"/> |
| |
| <p conref="../shared/impala_common.xml#common/s3_blurb"/> |
| <p conref="../shared/impala_common.xml#common/s3_metadata"/> |
| |
| <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/> |
| |
| <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/> |
| <p conref="../shared/impala_common.xml#common/kudu_metadata_intro"/> |
| <p conref="../shared/impala_common.xml#common/kudu_metadata_details"/> |
| |
| <p conref="../shared/impala_common.xml#common/udf_blurb"/> |
| <p conref="../shared/impala_common.xml#common/refresh_functions_tip"/> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| <p> |
| <xref href="impala_hadoop.xml#intro_metastore"/>, |
| <xref href="impala_invalidate_metadata.xml#invalidate_metadata"/> |
| </p> |
| </conbody> |
| </concept> |