docs/topics/impala_hadoop.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="intro_hadoop">

   <title>How Impala Fits Into the Hadoop Ecosystem</title>
   <titlealts audience="PDF"><navtitle>Role in the Hadoop Ecosystem</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Hadoop"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       Impala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with
       other Hadoop components, as both a consumer and a producer, so it can fit in flexible ways into your ETL and
       ELT pipelines.
     </p>

     <p outputclass="toc inpage"/>
   </conbody>

   <concept id="intro_hive">

     <title>How Impala Works with Hive</title>

     <conbody>

       <p>
         A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new
         categories of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing
         Apache Hive infrastructure that many Hadoop users already have in place to perform long-running,
         batch-oriented SQL queries.
       </p>

       <p>
         In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as
         the <b>metastore</b>, the same database where Hive keeps this type of data. Thus, Impala can access tables
         defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and
         compression codecs.
       </p>

       <p>
         The initial focus on query features and performance means that Impala can read more types of data with the
         <codeph>SELECT</codeph> statement than it can write with the <codeph>INSERT</codeph> statement. To query
         data using the Avro, RCFile, or SequenceFile <xref href="impala_file_formats.xml#file_formats">file
         formats</xref>, you load the data using Hive.
       </p>

       <p rev="1.2.2">
         The Impala query optimizer can also make use of <xref href="impala_perf_stats.xml#perf_table_stats">table
         statistics</xref> and <xref href="impala_perf_stats.xml#perf_column_stats">column statistics</xref>.
         Originally, you gathered this information with the <codeph>ANALYZE TABLE</codeph> statement in Hive; in
         Impala 1.2.2 and higher, use the Impala <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
         STATS</xref></codeph> statement instead. <codeph>COMPUTE STATS</codeph> requires less setup, is more
         reliable, and does not require switching back and forth between <cmdname>impala-shell</cmdname>
         and the Hive shell.
       </p>
     </conbody>
   </concept>

   <concept id="intro_metastore">

     <title>Overview of Impala Metadata and the Metastore</title>
   <prolog>
     <metadata>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Hive"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         As discussed in <xref href="impala_hadoop.xml#intro_hive"/>, Impala maintains information about table
         definitions in a central database known as the <b>metastore</b>. Impala also tracks other metadata for the
         low-level characteristics of data files:
       </p>

       <ul>
         <li>
           The physical locations of blocks within HDFS.
         </li>
       </ul>

       <p>
         For tables with a large volume of data and/or many partitions, retrieving all the metadata for a table can
         be time-consuming, taking minutes in some cases. Thus, each Impala node caches all of this metadata to
         reuse for future queries against the same table.
       </p>

       <p rev="1.2">
         If the table definition or the data in the table is updated, all other Impala daemons in the cluster must
         receive the latest metadata, replacing the obsolete cached metadata, before issuing a query against that
         table. In Impala 1.2 and higher, the metadata update is automatic, coordinated through the
         <cmdname>catalogd</cmdname> daemon, for all DDL and DML statements issued through Impala. See
         <xref href="impala_components.xml#intro_catalogd"/> for details.
       </p>

       <p>
         For DDL and DML issued through Hive, or changes made manually to files in HDFS, you still use the
         <codeph>REFRESH</codeph> statement (when new data files are added to existing tables) or the
         <codeph>INVALIDATE METADATA</codeph> statement (for entirely new tables, or after dropping a table,
         performing an HDFS rebalance operation, or deleting data files). Issuing <codeph>INVALIDATE
         METADATA</codeph> by itself retrieves metadata for all the tables tracked by the metastore. If you know
         that only specific tables have been changed outside of Impala, you can issue <codeph>REFRESH
         <varname>table_name</varname></codeph> for each affected table to only retrieve the latest metadata for
         those tables.
       </p>
     </conbody>
   </concept>

   <concept id="intro_hdfs">

     <title>How Impala Uses HDFS</title>
   <prolog>
     <metadata>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Impala"/>
       <data name="Category" value="HDFS"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         Impala uses the distributed filesystem HDFS as its primary data storage medium. Impala relies on the
         redundancy provided by HDFS to guard against hardware or network outages on individual nodes. Impala table
         data is physically represented as data files in HDFS, using familiar HDFS file formats and compression
         codecs. When data files are present in the directory for a new table, Impala reads them all, regardless of
         file name. New data is added in files with names controlled by Impala.
       </p>
     </conbody>
   </concept>

   <concept id="intro_hbase">

     <title>How Impala Uses HBase</title>
   <prolog>
     <metadata>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Impala"/>
       <data name="Category" value="HBase"/>
     </metadata>
   </prolog>

     <conbody>

       <p>
         HBase is an alternative to HDFS as a storage medium for Impala data. It is a database storage system built
         on top of HDFS, without built-in SQL support. Many Hadoop users already have it configured and store large
         (often sparse) data sets in it. By defining tables in Impala and mapping them to equivalent tables in
         HBase, you can query the contents of the HBase tables through Impala, and even perform join queries
         including both Impala and HBase tables. See <xref href="impala_hbase.xml#impala_hbase"/> for details.
       </p>
     </conbody>
   </concept>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="intro_hadoop">

	<title>How Impala Fits Into the Hadoop Ecosystem</title>
	<titlealts audience="PDF"><navtitle>Role in the Hadoop Ecosystem</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Concepts"/>
	<data name="Category" value="Hadoop"/>
	<data name="Category" value="Administrators"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	Impala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with
	other Hadoop components, as both a consumer and a producer, so it can fit in flexible ways into your ETL and
	ELT pipelines.
	</p>

	<p outputclass="toc inpage"/>
	</conbody>

	<concept id="intro_hive">

	<title>How Impala Works with Hive</title>

	<conbody>

	<p>
	A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new
	categories of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing
	Apache Hive infrastructure that many Hadoop users already have in place to perform long-running,
	batch-oriented SQL queries.
	</p>

	<p>
	In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as
	the <b>metastore</b>, the same database where Hive keeps this type of data. Thus, Impala can access tables
	defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and
	compression codecs.
	</p>

	<p>
	The initial focus on query features and performance means that Impala can read more types of data with the
	<codeph>SELECT</codeph> statement than it can write with the <codeph>INSERT</codeph> statement. To query
	data using the Avro, RCFile, or SequenceFile <xref href="impala_file_formats.xml#file_formats">file
	formats</xref>, you load the data using Hive.
	</p>

	<p rev="1.2.2">
	The Impala query optimizer can also make use of <xref href="impala_perf_stats.xml#perf_table_stats">table
	statistics</xref> and <xref href="impala_perf_stats.xml#perf_column_stats">column statistics</xref>.
	Originally, you gathered this information with the <codeph>ANALYZE TABLE</codeph> statement in Hive; in
	Impala 1.2.2 and higher, use the Impala <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
	STATS</xref></codeph> statement instead. <codeph>COMPUTE STATS</codeph> requires less setup, is more
	reliable, and does not require switching back and forth between <cmdname>impala-shell</cmdname>
	and the Hive shell.
	</p>
	</conbody>
	</concept>

	<concept id="intro_metastore">

	<title>Overview of Impala Metadata and the Metastore</title>
	<prolog>
	<metadata>
	<data name="Category" value="Concepts"/>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Hive"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	As discussed in <xref href="impala_hadoop.xml#intro_hive"/>, Impala maintains information about table
	definitions in a central database known as the <b>metastore</b>. Impala also tracks other metadata for the
	low-level characteristics of data files:
	</p>

	<ul>
	<li>
	The physical locations of blocks within HDFS.
	</li>
	</ul>

	<p>
	For tables with a large volume of data and/or many partitions, retrieving all the metadata for a table can
	be time-consuming, taking minutes in some cases. Thus, each Impala node caches all of this metadata to
	reuse for future queries against the same table.
	</p>

	<p rev="1.2">
	If the table definition or the data in the table is updated, all other Impala daemons in the cluster must
	receive the latest metadata, replacing the obsolete cached metadata, before issuing a query against that
	table. In Impala 1.2 and higher, the metadata update is automatic, coordinated through the
	<cmdname>catalogd</cmdname> daemon, for all DDL and DML statements issued through Impala. See
	<xref href="impala_components.xml#intro_catalogd"/> for details.
	</p>

	<p>
	For DDL and DML issued through Hive, or changes made manually to files in HDFS, you still use the
	<codeph>REFRESH</codeph> statement (when new data files are added to existing tables) or the
	<codeph>INVALIDATE METADATA</codeph> statement (for entirely new tables, or after dropping a table,
	performing an HDFS rebalance operation, or deleting data files). Issuing <codeph>INVALIDATE
	METADATA</codeph> by itself retrieves metadata for all the tables tracked by the metastore. If you know
	that only specific tables have been changed outside of Impala, you can issue <codeph>REFRESH
	<varname>table_name</varname></codeph> for each affected table to only retrieve the latest metadata for
	those tables.
	</p>
	</conbody>
	</concept>

	<concept id="intro_hdfs">

	<title>How Impala Uses HDFS</title>
	<prolog>
	<metadata>
	<data name="Category" value="Concepts"/>
	<data name="Category" value="Impala"/>
	<data name="Category" value="HDFS"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	Impala uses the distributed filesystem HDFS as its primary data storage medium. Impala relies on the
	redundancy provided by HDFS to guard against hardware or network outages on individual nodes. Impala table
	data is physically represented as data files in HDFS, using familiar HDFS file formats and compression
	codecs. When data files are present in the directory for a new table, Impala reads them all, regardless of
	file name. New data is added in files with names controlled by Impala.
	</p>
	</conbody>
	</concept>

	<concept id="intro_hbase">

	<title>How Impala Uses HBase</title>
	<prolog>
	<metadata>
	<data name="Category" value="Concepts"/>
	<data name="Category" value="Impala"/>
	<data name="Category" value="HBase"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	HBase is an alternative to HDFS as a storage medium for Impala data. It is a database storage system built
	on top of HDFS, without built-in SQL support. Many Hadoop users already have it configured and store large
	(often sparse) data sets in it. By defining tables in Impala and mapping them to equivalent tables in
	HBase, you can query the contents of the HBase tables through Impala, and even perform join queries
	including both Impala and HBase tables. See <xref href="impala_hbase.xml#impala_hbase"/> for details.
	</p>
	</conbody>
	</concept>
	</concept>