blob: 26f2d47e40eb3bb4bcc18c59c1f4265c1c8ce196 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="intro_hadoop">
<title>How Impala Fits Into the Hadoop Ecosystem</title>
<titlealts audience="PDF"><navtitle>Role in the Hadoop Ecosystem</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="Concepts"/>
<data name="Category" value="Hadoop"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p>
Impala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with
other Hadoop components, as both a consumer and a producer, so it can fit in flexible ways into your ETL and
ELT pipelines.
</p>
<p outputclass="toc inpage"/>
</conbody>
<concept id="intro_hive">
<title>How Impala Works with Hive</title>
<conbody>
<p>
A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new
categories of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing
Apache Hive infrastructure that many Hadoop users already have in place to perform long-running,
batch-oriented SQL queries.
</p>
<p>
In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as
the <b>metastore</b>, the same database where Hive keeps this type of data. Thus, Impala can access tables
defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and
compression codecs.
</p>
<p>
The initial focus on query features and performance means that Impala can read more types of data with the
<codeph>SELECT</codeph> statement than it can write with the <codeph>INSERT</codeph> statement. To query
data using the Avro, RCFile, or SequenceFile <xref href="impala_file_formats.xml#file_formats">file
formats</xref>, you load the data using Hive.
</p>
<p rev="1.2.2">
The Impala query optimizer can also make use of <xref href="impala_perf_stats.xml#perf_table_stats">table
statistics</xref> and <xref href="impala_perf_stats.xml#perf_column_stats">column statistics</xref>.
Originally, you gathered this information with the <codeph>ANALYZE TABLE</codeph> statement in Hive; in
Impala 1.2.2 and higher, use the Impala <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
STATS</xref></codeph> statement instead. <codeph>COMPUTE STATS</codeph> requires less setup, is more
reliable, and does not require switching back and forth between <cmdname>impala-shell</cmdname>
and the Hive shell.
</p>
</conbody>
</concept>
<concept id="intro_metastore">
<title>Overview of Impala Metadata and the Metastore</title>
<prolog>
<metadata>
<data name="Category" value="Concepts"/>
<data name="Category" value="Impala"/>
<data name="Category" value="Hive"/>
</metadata>
</prolog>
<conbody>
<p>
As discussed in <xref href="impala_hadoop.xml#intro_hive"/>, Impala maintains information about table
definitions in a central database known as the <b>metastore</b>. Impala also tracks other metadata for the
low-level characteristics of data files:
</p>
<ul>
<li>
The physical locations of blocks within HDFS.
</li>
</ul>
<p>
For tables with a large volume of data and/or many partitions, retrieving all the metadata for a table can
be time-consuming, taking minutes in some cases. Thus, each Impala node caches all of this metadata to
reuse for future queries against the same table.
</p>
<p rev="1.2">
If the table definition or the data in the table is updated, all other Impala daemons in the cluster must
receive the latest metadata, replacing the obsolete cached metadata, before issuing a query against that
table. In Impala 1.2 and higher, the metadata update is automatic, coordinated through the
<cmdname>catalogd</cmdname> daemon, for all DDL and DML statements issued through Impala. See
<xref href="impala_components.xml#intro_catalogd"/> for details.
</p>
<p>
For DDL and DML issued through Hive, or changes made manually to files in HDFS, you still use the
<codeph>REFRESH</codeph> statement (when new data files are added to existing tables) or the
<codeph>INVALIDATE METADATA</codeph> statement (for entirely new tables, or after dropping a table,
performing an HDFS rebalance operation, or deleting data files). Issuing <codeph>INVALIDATE
METADATA</codeph> by itself retrieves metadata for all the tables tracked by the metastore. If you know
that only specific tables have been changed outside of Impala, you can issue <codeph>REFRESH
<varname>table_name</varname></codeph> for each affected table to only retrieve the latest metadata for
those tables.
</p>
</conbody>
</concept>
<concept id="intro_hdfs">
<title>How Impala Uses HDFS</title>
<prolog>
<metadata>
<data name="Category" value="Concepts"/>
<data name="Category" value="Impala"/>
<data name="Category" value="HDFS"/>
</metadata>
</prolog>
<conbody>
<p>
Impala uses the distributed filesystem HDFS as its primary data storage medium. Impala relies on the
redundancy provided by HDFS to guard against hardware or network outages on individual nodes. Impala table
data is physically represented as data files in HDFS, using familiar HDFS file formats and compression
codecs. When data files are present in the directory for a new table, Impala reads them all, regardless of
file name. New data is added in files with names controlled by Impala.
</p>
</conbody>
</concept>
<concept id="intro_hbase">
<title>How Impala Uses HBase</title>
<prolog>
<metadata>
<data name="Category" value="Concepts"/>
<data name="Category" value="Impala"/>
<data name="Category" value="HBase"/>
</metadata>
</prolog>
<conbody>
<p>
HBase is an alternative to HDFS as a storage medium for Impala data. It is a database storage system built
on top of HDFS, without built-in SQL support. Many Hadoop users already have it configured and store large
(often sparse) data sets in it. By defining tables in Impala and mapping them to equivalent tables in
HBase, you can query the contents of the HBase tables through Impala, and even perform join queries
including both Impala and HBase tables. See <xref href="impala_hbase.xml#impala_hbase"/> for details.
</p>
</conbody>
</concept>
</concept>