blob: 8f5f7f3837f444e8f8f2cc9c82159067ef6e1db6 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="intro_components">
<title>Components of the Impala Server</title>
<titlealts audience="PDF"><navtitle>Components</navtitle></titlealts>
<data name="Category" value="Impala"/>
<data name="Category" value="Concepts"/>
<data name="Category" value="Administrators"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
<p> The Impala server is a distributed, massively parallel processing (MPP)
database engine. It consists of different daemon processes that run on
specific hosts within your cluster. </p>
<p outputclass="toc inpage"/>
<concept id="intro_impalad">
<title>The Impala Daemon</title>
<p> The core Impala component is the Impala daemon, physically represented
by the <codeph>impalad</codeph> process. A few of the key functions that
an Impala daemon performs are:<ul>
<li>Reads and writes to data files.</li>
<li>Accepts queries transmitted from the <codeph>impala-shell</codeph>
command, Hue, JDBC, or ODBC.</li>
<li>Parallelizes the queries and distributes work across the
<li>Transmits intermediate query results back to the central
coordinator. </li>
<p>Impala daemons can be deployed in one of the following ways:<ul>
<li>HDFS and Impala are co-located, and each Impala daemon runs on the
same host as a DataNode.</li>
<li>Impala is deployed separately in a compute cluster and reads
remotely from HDFS, S3, ADLS, etc.</li>
<p> The Impala daemons are in constant communication with StateStore, to
confirm which daemons are healthy and can accept new work. </p>
<p rev="1.2"> They also receive broadcast messages from the
<cmdname>catalogd</cmdname> daemon (introduced in Impala 1.2) whenever
any Impala daemon in the cluster creates, alters, or drops any type of
object, or when an <codeph>INSERT</codeph> or <codeph>LOAD DATA</codeph>
statement is processed through Impala. This background communication
minimizes the need for <codeph>REFRESH</codeph> or <codeph>INVALIDATE
METADATA</codeph> statements that were needed to coordinate metadata
across Impala daemons prior to Impala 1.2. </p>
<p rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">
In <keyword keyref="impala29_full"/> and higher, you can control which hosts act as query coordinators
and which act as query executors, to improve scalability for highly concurrent workloads on large clusters.
See <xref keyref="scalability_coordinator"/> for details.
<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
<xref href="impala_processes.xml#processes"/>, <xref href="impala_timeouts.xml#impalad_timeout"/>,
<xref href="impala_ports.xml#ports"/>, <xref href="impala_proxy.xml#proxy"/>
<concept id="intro_statestore">
<title>The Impala Statestore</title>
<p> The Impala component known as the StateStore checks on the health of
all Impala daemons in a cluster, and continuously relays its findings to
each of those daemons. It is physically represented by a daemon process
named <codeph>statestored</codeph>. You only need such a process on one
host in a cluster. If an Impala daemon goes offline due to hardware
failure, network error, software issue, or other reason, the StateStore
informs all the other Impala daemons so that future queries can avoid
making requests to the unreachable Impala daemon. </p>
<p> Because the StateStore's purpose is to help when things go wrong and
to broadcast metadata to coordinators, it is not always critical to the
normal operation of an Impala cluster. If the StateStore is not running
or becomes unreachable, the Impala daemons continue running and
distributing work among themselves as usual when working with the data
known to Impala. The cluster just becomes less robust if other Impala
daemons fail, and metadata becomes less consistent as it changes while
the StateStore is offline. When the StateStore comes back online, it
re-establishes communication with the Impala daemons and resumes its
monitoring and broadcasting functions. </p>
<p> If you issue a DDL statement while the StateStore is down, the queries
that access the new object the DDL created will fail. </p>
<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>
<b>Related information:</b>
<xref href="impala_scalability.xml#statestore_scalability"/>,
<xref href="impala_config_options.xml#config_options"/>, <xref href="impala_processes.xml#processes"/>,
<xref href="impala_timeouts.xml#statestore_timeout"/>, <xref href="impala_ports.xml#ports"/>
<concept rev="1.2" id="intro_catalogd">
<title>The Impala Catalog Service</title>
<p> The Impala component known as the Catalog Service relays the metadata
changes from Impala SQL statements to all the Impala daemons in a
cluster. It is physically represented by a daemon process named
<codeph>catalogd</codeph>. You only need such a process on one host in
a cluster. Because the requests are passed through the StateStore
daemon, it makes sense to run the <cmdname>statestored</cmdname> and
<cmdname>catalogd</cmdname> services on the same host. </p>
<p> The catalog service avoids the need to issue <codeph>REFRESH</codeph>
and <codeph>INVALIDATE METADATA</codeph> statements when the metadata
changes are performed by statements issued through Impala. When you
create a table, load data, and so on through Hive, you do need to issue
<codeph>REFRESH</codeph> or <codeph>INVALIDATE METADATA</codeph> on an
Impala daemon before executing a query there. </p>
This feature touches a number of aspects of Impala:
<ul id="catalogd_xrefs">
See <xref href="impala_install.xml#install"/>, <xref href="impala_upgrading.xml#upgrading"/> and
<xref href="impala_processes.xml#processes"/>, for usage information for the
<cmdname>catalogd</cmdname> daemon.
<p> The <codeph>REFRESH</codeph> and <codeph>INVALIDATE
METADATA</codeph> statements are not needed when the
<codeph>CREATE TABLE</codeph>, <codeph>INSERT</codeph>, or other
table-changing or data-changing operation is performed through
Impala. These statements are still needed if such operations are
done through Hive or by manipulating data files directly in HDFS,
but in those cases the statements only need to be issued on one
Impala daemon rather than on all daemons. See <xref
href="impala_refresh.xml#refresh"/> and <xref
href="impala_invalidate_metadata.xml#invalidate_metadata"/> for
the latest usage information for those statements. </p>
<p conref="../shared/impala_common.xml#common/load_catalog_in_background"/>
<p conref="../shared/impala_common.xml#common/statestored_catalogd_ha_blurb"/>
<p conref="../shared/impala_common.xml#common/catalog_server_124"/>
<b>Related information:</b> <xref href="impala_config_options.xml#config_options"/>,
<xref href="impala_processes.xml#processes"/>, <xref href="impala_ports.xml#ports"/>