docs/topics/impala_components.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="intro_components">

   <title>Components of the Impala Server</title>
   <titlealts audience="PDF"><navtitle>Components</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p> The Impala server is a distributed, massively parallel processing (MPP)
       database engine. It consists of different daemon processes that run on
       specific hosts within your cluster. </p>

     <p outputclass="toc inpage"/>
   </conbody>

   <concept id="intro_impalad">

     <title>The Impala Daemon</title>

     <conbody>

       <p> The core Impala component is the Impala daemon, physically represented
         by the <codeph>impalad</codeph> process. A few of the key functions that
         an Impala daemon performs are:<ul>
           <li>Reads and writes to data files.</li>
           <li>Accepts queries transmitted from the <codeph>impala-shell</codeph>
             command, Hue, JDBC, or ODBC.</li>
           <li>Parallelizes the queries and distributes work across the
             cluster.</li>
           <li>Transmits intermediate query results back to the central
             coordinator. </li>
         </ul></p>
       <p>Impala daemons can be deployed in one of the following ways:<ul>
           <li>HDFS and Impala are co-located, and each Impala daemon runs on the
             same host as a DataNode.</li>
           <li>Impala is deployed separately in a compute cluster and reads
             remotely from HDFS, S3, ADLS, etc.</li>
         </ul></p>

       <p> The Impala daemons are in constant communication with StateStore, to
         confirm which daemons are healthy and can accept new work. </p>

       <p rev="1.2"> They also receive broadcast messages from the
           <cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2) whenever
         any Impala daemon in the cluster creates, alters, or drops any type of
         object, or when an <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph>
         statement is processed through Impala. This background communication
         minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE
           METADATA</codeph> statements that were needed to coordinate metadata
         across Impala daemons prior to Impala 1.2. </p>

       <p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
         In <keyword keyref="impala29_full"/> and higher, you can control which hosts act as query coordinators
         and which act as query executors, to improve scalability for highly concurrent workloads on large clusters.
         See <xref keyref="scalability_coordinator"/> for details.
       </p>

       <p>
         <b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
         <xref href="impala_processes.xml#processes"/>, <xref href="impala_timeouts.xml#impalad_timeout"/>,
         <xref href="impala_ports.xml#ports"/>, <xref href="impala_proxy.xml#proxy"/>
       </p>
     </conbody>
   </concept>

   <concept id="intro_statestore">

     <title>The Impala Statestore</title>

     <conbody>

       <p> The Impala component known as the StateStore checks on the health of
         all Impala daemons in a cluster, and continuously relays its findings to
         each of those daemons. It is physically represented by a daemon process
         named <codeph>statestored</codeph>. You only need such a process on one
         host in a cluster. If an Impala daemon goes offline due to hardware
         failure, network error, software issue, or other reason, the StateStore
         informs all the other Impala daemons so that future queries can avoid
         making requests to the unreachable Impala daemon. </p>

       <p> Because the StateStore's purpose is to help when things go wrong and
         to broadcast metadata to coordinators, it is not always critical to the
         normal operation of an Impala cluster. If the StateStore is not running
         or becomes unreachable, the Impala daemons continue running and
         distributing work among themselves as usual when working with the data
         known to Impala. The cluster just becomes less robust if other Impala
         daemons fail, and metadata becomes less consistent as it changes while
         the StateStore is offline. When the StateStore comes back online, it
         re-establishes communication with the Impala daemons and resumes its
         monitoring and broadcasting functions. </p>

       <p> If you issue a DDL statement while the StateStore is down, the queries
         that access the new object the DDL created will fail. </p>

       <p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>

       <p>
         <b>Related information:</b>
       </p>

       <p>
         <xref href="impala_scalability.xml#statestore_scalability"/>,
         <xref href="impala_config_options.xml#config_options"/>, <xref href="impala_processes.xml#processes"/>,
         <xref href="impala_timeouts.xml#statestore_timeout"/>, <xref href="impala_ports.xml#ports"/>
       </p>
     </conbody>
   </concept>

   <concept rev="1.2" id="intro_catalogd">

     <title>The Impala Catalog Service</title>

     <conbody>

       <p> The Impala component known as the Catalog Service relays the metadata
         changes from Impala SQL statements to all the Impala daemons in a
         cluster. It is physically represented by a daemon process named
           <codeph>catalogd</codeph>. You only need such a process on one host in
         a cluster. Because the requests are passed through the StateStore
         daemon, it makes sense to run the <cmdname>statestored</cmdname> and
           <cmdname>catalogd</cmdname> services on the same host. </p>

       <p> The catalog service avoids the need to issue <codeph>REFRESH</codeph>
         and <codeph>INVALIDATE METADATA</codeph> statements when the metadata
         changes are performed by statements issued through Impala. When you
         create a table, load data, and so on through Hive, you do need to issue
           <codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an
         Impala daemon before executing a query there. </p>

       <p>
         This feature touches a number of aspects of Impala:
       </p>

       <ul id="catalogd_xrefs">
         <li>
           <p>
             See <xref href="impala_install.xml#install"/>, <xref href="impala_upgrading.xml#upgrading"/> and
             <xref href="impala_processes.xml#processes"/>, for usage information for the
             <cmdname>catalogd</cmdname> daemon.
           </p>
         </li>

         <li>
           <p> The <codeph>REFRESH</codeph> and <codeph>INVALIDATE
               METADATA</codeph> statements are not needed when the
               <codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other
             table-changing or data-changing operation is performed through
             Impala. These statements are still needed if such operations are
             done through Hive or by manipulating data files directly in HDFS,
             but in those cases the statements only need to be issued on one
             Impala daemon rather than on all daemons. See <xref
               href="impala_refresh.xml#refresh"/> and <xref
               href="impala_invalidate_metadata.xml#invalidate_metadata"/> for
             the latest usage information for those statements. </p>
         </li>
       </ul>

       <p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>

       <p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>

       <note>
         <p conref="../shared/impala_common.xml#common/catalog_server_124"/>
       </note>

       <p>
         <b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
         <xref href="impala_processes.xml#processes"/>, <xref href="impala_ports.xml#ports"/>
       </p>
     </conbody>
   </concept>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="intro_components">

	<title>Components of the Impala Server</title>
	<titlealts audience="PDF"><navtitle>Components</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Concepts"/>
	<data name="Category" value="Administrators"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p> The Impala server is a distributed, massively parallel processing (MPP)
	database engine. It consists of different daemon processes that run on
	specific hosts within your cluster. </p>

	<p outputclass="toc inpage"/>
	</conbody>

	<concept id="intro_impalad">

	<title>The Impala Daemon</title>

	<conbody>

	<p> The core Impala component is the Impala daemon, physically represented
	by the <codeph>impalad</codeph> process. A few of the key functions that
	an Impala daemon performs are:<ul>
	<li>Reads and writes to data files.</li>
	<li>Accepts queries transmitted from the <codeph>impala-shell</codeph>
	command, Hue, JDBC, or ODBC.</li>
	<li>Parallelizes the queries and distributes work across the
	cluster.</li>
	<li>Transmits intermediate query results back to the central
	coordinator. </li>
	</ul></p>
	<p>Impala daemons can be deployed in one of the following ways:<ul>
	<li>HDFS and Impala are co-located, and each Impala daemon runs on the
	same host as a DataNode.</li>
	<li>Impala is deployed separately in a compute cluster and reads
	remotely from HDFS, S3, ADLS, etc.</li>
	</ul></p>

	<p> The Impala daemons are in constant communication with StateStore, to
	confirm which daemons are healthy and can accept new work. </p>

	<p rev="1.2"> They also receive broadcast messages from the
	<cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2) whenever
	any Impala daemon in the cluster creates, alters, or drops any type of
	object, or when an <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph>
	statement is processed through Impala. This background communication
	minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE
	METADATA</codeph> statements that were needed to coordinate metadata
	across Impala daemons prior to Impala 1.2. </p>

	<p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
	In <keyword keyref="impala29_full"/> and higher, you can control which hosts act as query coordinators
	and which act as query executors, to improve scalability for highly concurrent workloads on large clusters.
	See <xref keyref="scalability_coordinator"/> for details.
	</p>

	<p>
	<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
	<xref href="impala_processes.xml#processes"/>, <xref href="impala_timeouts.xml#impalad_timeout"/>,
	<xref href="impala_ports.xml#ports"/>, <xref href="impala_proxy.xml#proxy"/>
	</p>
	</conbody>
	</concept>

	<concept id="intro_statestore">

	<title>The Impala Statestore</title>

	<conbody>

	<p> The Impala component known as the StateStore checks on the health of
	all Impala daemons in a cluster, and continuously relays its findings to
	each of those daemons. It is physically represented by a daemon process
	named <codeph>statestored</codeph>. You only need such a process on one
	host in a cluster. If an Impala daemon goes offline due to hardware
	failure, network error, software issue, or other reason, the StateStore
	informs all the other Impala daemons so that future queries can avoid
	making requests to the unreachable Impala daemon. </p>

	<p> Because the StateStore's purpose is to help when things go wrong and
	to broadcast metadata to coordinators, it is not always critical to the
	normal operation of an Impala cluster. If the StateStore is not running
	or becomes unreachable, the Impala daemons continue running and
	distributing work among themselves as usual when working with the data
	known to Impala. The cluster just becomes less robust if other Impala
	daemons fail, and metadata becomes less consistent as it changes while
	the StateStore is offline. When the StateStore comes back online, it
	re-establishes communication with the Impala daemons and resumes its
	monitoring and broadcasting functions. </p>

	<p> If you issue a DDL statement while the StateStore is down, the queries
	that access the new object the DDL created will fail. </p>

	<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>

	<p>
	<b>Related information:</b>
	</p>

	<p>
	<xref href="impala_scalability.xml#statestore_scalability"/>,
	<xref href="impala_config_options.xml#config_options"/>, <xref href="impala_processes.xml#processes"/>,
	<xref href="impala_timeouts.xml#statestore_timeout"/>, <xref href="impala_ports.xml#ports"/>
	</p>
	</conbody>
	</concept>

	<concept rev="1.2" id="intro_catalogd">

	<title>The Impala Catalog Service</title>

	<conbody>

	<p> The Impala component known as the Catalog Service relays the metadata
	changes from Impala SQL statements to all the Impala daemons in a
	cluster. It is physically represented by a daemon process named
	<codeph>catalogd</codeph>. You only need such a process on one host in
	a cluster. Because the requests are passed through the StateStore
	daemon, it makes sense to run the <cmdname>statestored</cmdname> and
	<cmdname>catalogd</cmdname> services on the same host. </p>

	<p> The catalog service avoids the need to issue <codeph>REFRESH</codeph>
	and <codeph>INVALIDATE METADATA</codeph> statements when the metadata
	changes are performed by statements issued through Impala. When you
	create a table, load data, and so on through Hive, you do need to issue
	<codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an
	Impala daemon before executing a query there. </p>

	<p>
	This feature touches a number of aspects of Impala:
	</p>

	<ul id="catalogd_xrefs">
	<li>
	<p>
	See <xref href="impala_install.xml#install"/>, <xref href="impala_upgrading.xml#upgrading"/> and
	<xref href="impala_processes.xml#processes"/>, for usage information for the
	<cmdname>catalogd</cmdname> daemon.
	</p>
	</li>

	<li>
	<p> The <codeph>REFRESH</codeph> and <codeph>INVALIDATE
	METADATA</codeph> statements are not needed when the
	<codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other
	table-changing or data-changing operation is performed through
	Impala. These statements are still needed if such operations are
	done through Hive or by manipulating data files directly in HDFS,
	but in those cases the statements only need to be issued on one
	Impala daemon rather than on all daemons. See <xref
	href="impala_refresh.xml#refresh"/> and <xref
	href="impala_invalidate_metadata.xml#invalidate_metadata"/> for
	the latest usage information for those statements. </p>
	</li>
	</ul>

	<p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>

	<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>

	<note>
	<p conref="../shared/impala_common.xml#common/catalog_server_124"/>
	</note>

	<p>
	<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
	<xref href="impala_processes.xml#processes"/>, <xref href="impala_ports.xml#ports"/>
	</p>
	</conbody>
	</concept>
	</concept>