docs/topics/impala_components.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="intro_components">

   <title>Components of the Impala Server</title>
   <titlealts audience="PDF"><navtitle>Components</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of
       different daemon processes that run on specific hosts within your <keyword keyref="distro"/> cluster.
     </p>

     <p outputclass="toc inpage"/>
   </conbody>

   <concept id="intro_impalad">

     <title>The Impala Daemon</title>

     <conbody>

       <p>
         The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented
         by the <codeph>impalad</codeph> process. It reads and writes to data files; accepts queries transmitted
         from the <codeph>impala-shell</codeph> command, Hue, JDBC, or ODBC; parallelizes the queries and
         distributes work across the cluster; and transmits intermediate query results back to the
         central coordinator node.
       </p>

       <p>
         You can submit a query to the Impala daemon running on any DataNode, and that instance of the daemon serves as the
         <term>coordinator node</term> for that query. The other nodes transmit partial results back to the
         coordinator, which constructs the final result set for a query. When running experiments with functionality
         through the <codeph>impala-shell</codeph> command, you might always connect to the same Impala daemon for
         convenience. For clusters running production workloads, you might load-balance by
         submitting each query to a different Impala daemon in round-robin style, using the JDBC or ODBC interfaces.
       </p>

       <p>
         The Impala daemons are in constant communication with the <term>statestore</term>, to confirm which nodes
         are healthy and can accept new work.
       </p>

       <p rev="1.2">
         They also receive broadcast messages from the <cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2)
         whenever any Impala node in the cluster creates, alters, or drops any type of object, or when an
         <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statement is processed through Impala. This
         background communication minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE
         METADATA</codeph> statements that were needed to coordinate metadata across nodes prior to Impala 1.2.
       </p>

       <p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
         In <keyword keyref="impala29_full"/> and higher, you can control which hosts act as query coordinators
         and which act as query executors, to improve scalability for highly concurrent workloads on large clusters.
         See <xref keyref="scalability_coordinator"/> for details.
       </p>

       <p>
         <b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
         <xref href="impala_processes.xml#processes"/>, <xref href="impala_timeouts.xml#impalad_timeout"/>,
         <xref href="impala_ports.xml#ports"/>, <xref href="impala_proxy.xml#proxy"/>
       </p>
     </conbody>
   </concept>

   <concept id="intro_statestore">

     <title>The Impala Statestore</title>

     <conbody>

       <p>
         The Impala component known as the <term>statestore</term> checks on the health of Impala daemons on all the
         DataNodes in a cluster, and continuously relays its findings to each of those daemons. It is physically
         represented by a daemon process named <codeph>statestored</codeph>; you only need such a process on one
         host in the cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue,
         or other reason, the statestore informs all the other Impala daemons so that future queries can avoid making
         requests to the unreachable node.
       </p>

       <p>
         Because the statestore's purpose is to help when things go wrong, it is not critical to the normal
         operation of an Impala cluster. If the statestore is not running or becomes unreachable, the Impala daemons
         continue running and distributing work among themselves as usual; the cluster just becomes less robust if
         other Impala daemons fail while the statestore is offline. When the statestore comes back online, it re-establishes
         communication with the Impala daemons and resumes its monitoring function.
       </p>

       <p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>

       <p>
         <b>Related information:</b>
       </p>

       <p>
         <xref href="impala_scalability.xml#statestore_scalability"/>,
         <xref href="impala_config_options.xml#config_options"/>, <xref href="impala_processes.xml#processes"/>,
         <xref href="impala_timeouts.xml#statestore_timeout"/>, <xref href="impala_ports.xml#ports"/>
       </p>
     </conbody>
   </concept>

   <concept rev="1.2" id="intro_catalogd">

     <title>The Impala Catalog Service</title>

     <conbody>

       <p>
         The Impala component known as the <term>catalog service</term> relays the metadata changes from Impala SQL
         statements to all the DataNodes in a cluster. It is physically represented by a daemon process named
         <codeph>catalogd</codeph>; you only need such a process on one host in the cluster. Because the requests
         are passed through the statestore daemon, it makes sense to run the <cmdname>statestored</cmdname> and
         <cmdname>catalogd</cmdname> services on the same host.
       </p>

       <p>
         The catalog service avoids the need to issue
         <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements when the metadata changes are
         performed by statements issued through Impala. When you create a table, load data, and so on through Hive,
         you do need to issue <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an Impala node
         before executing a query there.
       </p>

       <p>
         This feature touches a number of aspects of Impala:
       </p>

 <!-- This was formerly a conref, but since the list of links also included a link
      to this same topic, materializing the list here and removing that
      circular link. (The conref is still used in Incompatible Changes.)

       <ul conref="../shared/impala_common.xml#common/catalogd_xrefs">
         <li/>
       </ul>
 -->

       <ul id="catalogd_xrefs">
         <li>
           <p>
             See <xref href="impala_install.xml#install"/>, <xref href="impala_upgrading.xml#upgrading"/> and
             <xref href="impala_processes.xml#processes"/>, for usage information for the
             <cmdname>catalogd</cmdname> daemon.
           </p>
         </li>

         <li>
           <p>
             The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements are not needed
             when the <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other table-changing or
             data-changing operation is performed through Impala. These statements are still needed if such
             operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the
             statements only need to be issued on one Impala node rather than on all nodes. See
             <xref href="impala_refresh.xml#refresh"/> and
             <xref href="impala_invalidate_metadata.xml#invalidate_metadata"/> for the latest usage information for
             those statements.
           </p>
         </li>
       </ul>

       <p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>

       <p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>

       <note>
         <p conref="../shared/impala_common.xml#common/catalog_server_124"/>
       </note>

       <p>
         <b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
         <xref href="impala_processes.xml#processes"/>, <xref href="impala_ports.xml#ports"/>
       </p>
     </conbody>
   </concept>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="intro_components">

	<title>Components of the Impala Server</title>
	<titlealts audience="PDF"><navtitle>Components</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Concepts"/>
	<data name="Category" value="Administrators"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	The Impala server is a distributed, massively parallel processing (MPP) database engine. It consists of
	different daemon processes that run on specific hosts within your <keyword keyref="distro"/> cluster.
	</p>

	<p outputclass="toc inpage"/>
	</conbody>

	<concept id="intro_impalad">

	<title>The Impala Daemon</title>

	<conbody>

	<p>
	The core Impala component is a daemon process that runs on each DataNode of the cluster, physically represented
	by the <codeph>impalad</codeph> process. It reads and writes to data files; accepts queries transmitted
	from the <codeph>impala-shell</codeph> command, Hue, JDBC, or ODBC; parallelizes the queries and
	distributes work across the cluster; and transmits intermediate query results back to the
	central coordinator node.
	</p>

	<p>
	You can submit a query to the Impala daemon running on any DataNode, and that instance of the daemon serves as the
	<term>coordinator node</term> for that query. The other nodes transmit partial results back to the
	coordinator, which constructs the final result set for a query. When running experiments with functionality
	through the <codeph>impala-shell</codeph> command, you might always connect to the same Impala daemon for
	convenience. For clusters running production workloads, you might load-balance by
	submitting each query to a different Impala daemon in round-robin style, using the JDBC or ODBC interfaces.
	</p>

	<p>
	The Impala daemons are in constant communication with the <term>statestore</term>, to confirm which nodes
	are healthy and can accept new work.
	</p>

	<p rev="1.2">
	They also receive broadcast messages from the <cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2)
	whenever any Impala node in the cluster creates, alters, or drops any type of object, or when an
	<codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph> statement is processed through Impala. This
	background communication minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE
	METADATA</codeph> statements that were needed to coordinate metadata across nodes prior to Impala 1.2.
	</p>

	<p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
	In <keyword keyref="impala29_full"/> and higher, you can control which hosts act as query coordinators
	and which act as query executors, to improve scalability for highly concurrent workloads on large clusters.
	See <xref keyref="scalability_coordinator"/> for details.
	</p>

	<p>
	<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
	<xref href="impala_processes.xml#processes"/>, <xref href="impala_timeouts.xml#impalad_timeout"/>,
	<xref href="impala_ports.xml#ports"/>, <xref href="impala_proxy.xml#proxy"/>
	</p>
	</conbody>
	</concept>

	<concept id="intro_statestore">

	<title>The Impala Statestore</title>

	<conbody>

	<p>
	The Impala component known as the <term>statestore</term> checks on the health of Impala daemons on all the
	DataNodes in a cluster, and continuously relays its findings to each of those daemons. It is physically
	represented by a daemon process named <codeph>statestored</codeph>; you only need such a process on one
	host in the cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue,
	or other reason, the statestore informs all the other Impala daemons so that future queries can avoid making
	requests to the unreachable node.
	</p>

	<p>
	Because the statestore's purpose is to help when things go wrong, it is not critical to the normal
	operation of an Impala cluster. If the statestore is not running or becomes unreachable, the Impala daemons
	continue running and distributing work among themselves as usual; the cluster just becomes less robust if
	other Impala daemons fail while the statestore is offline. When the statestore comes back online, it re-establishes
	communication with the Impala daemons and resumes its monitoring function.
	</p>

	<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>

	<p>
	<b>Related information:</b>
	</p>

	<p>
	<xref href="impala_scalability.xml#statestore_scalability"/>,
	<xref href="impala_config_options.xml#config_options"/>, <xref href="impala_processes.xml#processes"/>,
	<xref href="impala_timeouts.xml#statestore_timeout"/>, <xref href="impala_ports.xml#ports"/>
	</p>
	</conbody>
	</concept>

	<concept rev="1.2" id="intro_catalogd">

	<title>The Impala Catalog Service</title>

	<conbody>

	<p>
	The Impala component known as the <term>catalog service</term> relays the metadata changes from Impala SQL
	statements to all the DataNodes in a cluster. It is physically represented by a daemon process named
	<codeph>catalogd</codeph>; you only need such a process on one host in the cluster. Because the requests
	are passed through the statestore daemon, it makes sense to run the <cmdname>statestored</cmdname> and
	<cmdname>catalogd</cmdname> services on the same host.
	</p>

	<p>
	The catalog service avoids the need to issue
	<codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements when the metadata changes are
	performed by statements issued through Impala. When you create a table, load data, and so on through Hive,
	you do need to issue <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an Impala node
	before executing a query there.
	</p>

	<p>
	This feature touches a number of aspects of Impala:
	</p>

	<!-- This was formerly a conref, but since the list of links also included a link
	to this same topic, materializing the list here and removing that
	circular link. (The conref is still used in Incompatible Changes.)

	<ul conref="../shared/impala_common.xml#common/catalogd_xrefs">
	<li/>
	</ul>
	-->

	<ul id="catalogd_xrefs">
	<li>
	<p>
	See <xref href="impala_install.xml#install"/>, <xref href="impala_upgrading.xml#upgrading"/> and
	<xref href="impala_processes.xml#processes"/>, for usage information for the
	<cmdname>catalogd</cmdname> daemon.
	</p>
	</li>

	<li>
	<p>
	The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph> statements are not needed
	when the <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other table-changing or
	data-changing operation is performed through Impala. These statements are still needed if such
	operations are done through Hive or by manipulating data files directly in HDFS, but in those cases the
	statements only need to be issued on one Impala node rather than on all nodes. See
	<xref href="impala_refresh.xml#refresh"/> and
	<xref href="impala_invalidate_metadata.xml#invalidate_metadata"/> for the latest usage information for
	those statements.
	</p>
	</li>
	</ul>

	<p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>

	<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>

	<note>
	<p conref="../shared/impala_common.xml#common/catalog_server_124"/>
	</note>

	<p>
	<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
	<xref href="impala_processes.xml#processes"/>, <xref href="impala_ports.xml#ports"/>
	</p>
	</conbody>
	</concept>
	</concept>