| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="tables"> |
| |
| <title>Overview of Impala Tables</title> |
| <titlealts audience="PDF"><navtitle>Tables</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Databases"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Tables"/> |
| <data name="Category" value="Schemas"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p/> |
| |
| <p> |
| Tables are the primary containers for data in Impala. They have the familiar row and column layout similar to |
| other database systems, plus some features such as partitioning often associated with higher-end data |
| warehouse systems. |
| </p> |
| |
| <p> |
| Logically, each table has a structure based on the definition of its columns, partitions, and other |
| properties. |
| </p> |
| |
| <p> |
| Physically, each table that uses HDFS storage is associated with a directory in HDFS. The table data consists of all the data files |
| underneath that directory: |
| </p> |
| |
| <ul> |
| <li> |
| <xref href="impala_tables.xml#internal_tables">Internal tables</xref> are managed by Impala, and use directories |
| inside the designated Impala work area. |
| </li> |
| |
| <li> |
| <xref href="impala_tables.xml#external_tables">External tables</xref> use arbitrary HDFS directories, where |
| the data files are typically shared between different Hadoop components. |
| </li> |
| |
| <li> |
| Large-scale data is usually handled by partitioned tables, where the data files are divided among different |
| HDFS subdirectories. |
| </li> |
| </ul> |
| |
| <p rev="2.2.0"> |
| Impala tables can also represent data that is stored in HBase, or in the Amazon S3 filesystem (<keyword keyref="impala22_full"/> or higher), |
| or on Isilon storage devices (<keyword keyref="impala223_full"/> or higher). See <xref href="impala_hbase.xml#impala_hbase"/>, |
| <xref href="impala_s3.xml#s3"/>, and <xref href="impala_isilon.xml#impala_isilon"/> |
| for details about those special kinds of tables. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/ignore_file_extensions"/> |
| |
| <p outputclass="toc inpage"/> |
| |
| <p> |
| <b>Related statements:</b> <xref href="impala_create_table.xml#create_table"/>, |
| <xref href="impala_drop_table.xml#drop_table"/>, <xref href="impala_alter_table.xml#alter_table"/> |
| <xref href="impala_insert.xml#insert"/>, <xref href="impala_load_data.xml#load_data"/>, |
| <xref href="impala_select.xml#select"/> |
| </p> |
| </conbody> |
| |
| <concept id="internal_tables"> |
| |
| <title>Internal Tables</title> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">internal tables</indexterm> |
| The default kind of table produced by the <codeph>CREATE TABLE</codeph> statement is known as an internal |
| table. (Its counterpart is the external table, produced by the <codeph>CREATE EXTERNAL TABLE</codeph> |
| syntax.) |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Impala creates a directory in HDFS to hold the data files. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| You can create data in internal tables by issuing <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> |
| statements. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If you add or replace data using HDFS operations, issue the <codeph>REFRESH</codeph> command in |
| <cmdname>impala-shell</cmdname> so that Impala recognizes the changes in data files, block locations, |
| and so on. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| When you issue a <codeph>DROP TABLE</codeph> statement, Impala physically removes all the data files |
| from the directory. |
| </p> |
| </li> |
| |
| <li> |
| <p conref="../shared/impala_common.xml#common/check_internal_external_table"/> |
| </li> |
| |
| <li> |
| <p> |
| When you issue an <codeph>ALTER TABLE</codeph> statement to rename an internal table, all data files |
| are moved into the new HDFS directory for the table. The files are moved even if they were formerly in |
| a directory outside the Impala data directory, for example in an internal table with a |
| <codeph>LOCATION</codeph> attribute pointing to an outside HDFS directory. |
| </p> |
| </li> |
| </ul> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p conref="../shared/impala_common.xml#common/switch_internal_external_table"/> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| <xref href="impala_tables.xml#external_tables"/>, <xref href="impala_create_table.xml#create_table"/>, |
| <xref href="impala_drop_table.xml#drop_table"/>, <xref href="impala_alter_table.xml#alter_table"/>, |
| <xref href="impala_describe.xml#describe"/> |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="external_tables"> |
| |
| <title>External Tables</title> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">external tables</indexterm> |
| The syntax <codeph>CREATE EXTERNAL TABLE</codeph> sets up an Impala table that points at existing data |
| files, potentially in HDFS locations outside the normal Impala data directories.. This operation saves the |
| expense of importing the data into a new table when you already have the data files in a known location in |
| HDFS, in the desired file format. |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| You can use Impala to query the data in this table. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| You can create data in external tables by issuing <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> |
| statements. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| If you add or replace data using HDFS operations, issue the <codeph>REFRESH</codeph> command in |
| <cmdname>impala-shell</cmdname> so that Impala recognizes the changes in data files, block locations, |
| and so on. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| When you issue a <codeph>DROP TABLE</codeph> statement in Impala, that removes the connection that |
| Impala has with the associated data files, but does not physically remove the underlying data. You can |
| continue to use the data files with other Hadoop components and HDFS operations. |
| </p> |
| </li> |
| |
| <li> |
| <p conref="../shared/impala_common.xml#common/check_internal_external_table"/> |
| </li> |
| |
| <li> |
| <p> |
| When you issue an <codeph>ALTER TABLE</codeph> statement to rename an external table, all data files |
| are left in their original locations. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| You can point multiple external tables at the same HDFS directory by using the same |
| <codeph>LOCATION</codeph> attribute for each one. The tables could have different column definitions, |
| as long as the number and types of columns are compatible with the schema evolution considerations for |
| the underlying file type. For example, for text data files, one table might define a certain column as |
| a <codeph>STRING</codeph> while another defines the same column as a <codeph>BIGINT</codeph>. |
| </p> |
| </li> |
| </ul> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p conref="../shared/impala_common.xml#common/switch_internal_external_table"/> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| <xref href="impala_tables.xml#internal_tables"/>, <xref href="impala_create_table.xml#create_table"/>, |
| <xref href="impala_drop_table.xml#drop_table"/>, <xref href="impala_alter_table.xml#alter_table"/>, |
| <xref href="impala_describe.xml#describe"/> |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="table_file_formats"> |
| <title>File Formats</title> |
| |
| <conbody> |
| <p> |
| Each table has an associated file format, which determines how Impala interprets the |
| associated data files. See <xref href="impala_file_formats.xml#file_formats"/> for details. |
| </p> |
| <p> |
| You set the file format during the <codeph>CREATE TABLE</codeph> statement, |
| or change it later using the <codeph>ALTER TABLE</codeph> statement. |
| Partitioned tables can have a different file format for individual partitions, |
| allowing you to change the file format used in your ETL process for new data |
| without going back and reconverting all the existing data in the same table. |
| </p> |
| <p> |
| Any <codeph>INSERT</codeph> statements produce new data files with the current file format of the table. |
| For existing data files, changing the file format of the table does not automatically do any data conversion. |
| You must use <codeph>TRUNCATE TABLE</codeph> or <codeph>INSERT OVERWRITE</codeph> to remove any previous data |
| files that use the old file format. |
| Then you use the <codeph>LOAD DATA</codeph> statement, <codeph>INSERT ... SELECT</codeph>, or other mechanism |
| to put data files of the correct format into the table. |
| </p> |
| <p> |
| The default file format, text, is the most flexible and easy to produce when you are just getting started with |
| Impala. The Parquet file format offers the highest query performance and uses compression to reduce storage |
| requirements; therefore, where practical, use Parquet for Impala tables with substantial amounts of data. |
| <ph rev="2.3.0">Also, the complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>) |
| available in <keyword keyref="impala23_full"/> and higher are currently only supported with the Parquet file type.</ph> |
| Based on your existing ETL workflow, you might use other file formats such as Avro, possibly doing a final |
| conversion step to Parquet to take advantage of its performance for analytic queries. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept rev="kudu" id="kudu_tables"> |
| <title>Kudu Tables</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Kudu"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| <p> |
| Tables stored in Apache Kudu are treated specially, because Kudu manages its data independently of HDFS files. |
| Some information about the table is stored in the metastore database for use by Impala. Other table metadata is |
| managed internally by Kudu. |
| </p> |
| |
| <p> |
| When you create a Kudu table through Impala, it is assigned an internal Kudu table name of the form |
| <codeph>impala::<varname>db_name</varname>.<varname>table_name</varname></codeph>. You can see the Kudu-assigned name |
| in the output of <codeph>DESCRIBE FORMATTED</codeph>, in the <codeph>kudu.table_name</codeph> field of the table properties. |
| The Kudu-assigned name remains the same even if you use <codeph>ALTER TABLE</codeph> to rename the Impala table |
| or move it to a different Impala database. If you issue the statement |
| <codeph>ALTER TABLE <varname>impala_name</varname> SET TBLPROPERTIES('kudu.table_name' = '<varname>different_kudu_table_name</varname>')</codeph>, |
| the effect is different depending on whether the Impala table was created with a regular <codeph>CREATE TABLE</codeph> |
| statement (that is, if it is an internal or managed table), or if it was created with a |
| <codeph>CREATE EXTERNAL TABLE</codeph> statement (and therefore is an external table). Changing the <codeph>kudu.table_name</codeph> |
| property of an internal table physically renames the underlying Kudu table to match the new name. |
| Changing the <codeph>kudu.table_name</codeph> property of an external table switches which underlying Kudu table |
| the Impala table refers to; the underlying Kudu table must already exist. |
| </p> |
| |
| <p> |
| The following example shows what happens with both internal and external Kudu tables as the <codeph>kudu.table_name</codeph> |
| property is changed. In practice, external tables are typically used to access underlying Kudu tables that were created |
| outside of Impala, that is, through the Kudu API. |
| </p> |
| |
| <codeblock> |
| -- This is an internal table that we will create and then rename. |
| create table old_name (id bigint primary key, s string) |
| partition by hash(id) partitions 2 stored as kudu; |
| |
| -- Initially, the name OLD_NAME is the same on the Impala and Kudu sides. |
| describe formatted old_name; |
| ... |
| | Location: | hdfs://host.example.com:8020/path/user.db/old_name |
| | Table Type: | MANAGED_TABLE | NULL |
| | Table Parameters: | NULL | NULL |
| | | DO_NOT_UPDATE_STATS | true |
| | | kudu.master_addresses | vd0342.example.com |
| | | kudu.table_name | impala::user.old_name |
| |
| -- ALTER TABLE RENAME TO changes the Impala name but not the underlying Kudu name. |
| alter table old_name rename to new_name; |
| |
| describe formatted new_name; |
| | Location: | hdfs://host.example.com:8020/path/user.db/new_name |
| | Table Type: | MANAGED_TABLE | NULL |
| | Table Parameters: | NULL | NULL |
| | | DO_NOT_UPDATE_STATS | true |
| | | kudu.master_addresses | vd0342.example.com |
| | | kudu.table_name | impala::user.old_name |
| |
| -- Setting TBLPROPERTIES changes the underlying Kudu name. |
| alter table new_name |
| set tblproperties('kudu.table_name' = 'impala::user.new_name'); |
| |
| describe formatted new_name; |
| | Location: | hdfs://host.example.com:8020/path/user.db/new_name |
| | Table Type: | MANAGED_TABLE | NULL |
| | Table Parameters: | NULL | NULL |
| | | DO_NOT_UPDATE_STATS | true |
| | | kudu.master_addresses | vd0342.example.com |
| | | kudu.table_name | impala::user.new_name |
| |
| -- Put some data in the table to demonstrate how external tables can map to |
| -- different underlying Kudu tables. |
| insert into new_name values (0, 'zero'), (1, 'one'), (2, 'two'); |
| |
| -- This external table points to the same underlying Kudu table, NEW_NAME, |
| -- as we created above. No need to declare columns or other table aspects. |
| create external table kudu_table_alias stored as kudu |
| tblproperties('kudu.table_name' = 'impala::user.new_name'); |
| |
| -- The external table can fetch data from the NEW_NAME table that already |
| -- existed and already had data. |
| select * from kudu_table_alias limit 100; |
| +----+------+ |
| | id | s | |
| +----+------+ |
| | 1 | one | |
| | 0 | zero | |
| | 2 | two | |
| +----+------+ |
| |
| -- We cannot re-point the external table at a different underlying Kudu table |
| -- unless that other underlying Kudu table already exists. |
| alter table kudu_table_alias |
| set tblproperties('kudu.table_name' = 'impala::user.yet_another_name'); |
| ERROR: |
| TableLoadingException: Error opening Kudu table 'impala::user.yet_another_name', |
| Kudu error: The table does not exist: table_name: "impala::user.yet_another_name" |
| |
| -- Once the underlying Kudu table exists, we can re-point the external table to it. |
| create table yet_another_name (id bigint primary key, x int, y int, s string) |
| partition by hash(id) partitions 2 stored as kudu; |
| |
| alter table kudu_table_alias |
| set tblproperties('kudu.table_name' = 'impala::user.yet_another_name'); |
| |
| -- Now no data is returned because this other table is empty. |
| select * from kudu_table_alias limit 100; |
| |
| -- The Impala table automatically recognizes the table schema of the new table, |
| -- for example the extra X and Y columns not present in the original table. |
| describe kudu_table_alias; |
| +------+--------+---------+-------------+----------+... |
| | name | type | comment | primary_key | nullable |... |
| +------+--------+---------+-------------+----------+... |
| | id | bigint | | true | false |... |
| | x | int | | false | true |... |
| | y | int | | false | true |... |
| | s | string | | false | true |... |
| +------+--------+---------+-------------+----------+... |
| </codeblock> |
| |
| <p> |
| The <codeph>SHOW TABLE STATS</codeph> output for a Kudu table shows Kudu-specific details about the layout of the table. |
| Instead of information about the number and sizes of files, the information is divided by the Kudu tablets. |
| For each tablet, the output includes the fields |
| <codeph># Rows</codeph> (although this number is not currently computed), <codeph>Start Key</codeph>, <codeph>Stop Key</codeph>, <codeph>Leader Replica</codeph>, and <codeph># Replicas</codeph>. |
| The output of <codeph>SHOW COLUMN STATS</codeph>, illustrating the distribution of values within each column, is the same for Kudu tables |
| as for HDFS-backed tables. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/kudu_internal_external_tables"/> |
| </conbody> |
| </concept> |
| |
| </concept> |