docs/topics/impala_concepts.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="concepts">

   <title>Impala Concepts and Architecture</title>
   <titlealts audience="PDF"><navtitle>Concepts and Architecture</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Data Analysts"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Stub Pages"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       The following sections provide background information to help you become productive using Impala and
       its features. Where appropriate, the explanations include context to help understand how aspects of Impala
       relate to other technologies you might already be familiar with, such as relational database management
       systems and data warehouses, or other Hadoop components such as Hive, HDFS, and HBase.
     </p>

     <p outputclass="toc"/>
   </conbody>

 <!-- These other topics are waiting to be filled in. Could become subtopics or top-level topics depending on the depth of coverage in each case. -->

   <concept id="intro_data_lifecycle" audience="hidden">

     <title>Overview of the Data Lifecycle for Impala</title>

     <conbody/>
   </concept>

   <concept id="intro_etl" audience="hidden">

     <title>Overview of the Extract, Transform, Load (ETL) Process for Impala</title>
   <prolog>
     <metadata>
       <data name="Category" value="ETL"/>
       <data name="Category" value="Ingest"/>
       <data name="Category" value="Concepts"/>
     </metadata>
   </prolog>

     <conbody/>
   </concept>

   <concept id="intro_hadoop_data" audience="hidden">

     <title>How Impala Works with Hadoop Data Files</title>

     <conbody/>
   </concept>

   <concept id="intro_web_ui" audience="hidden">

     <title>Overview of the Impala Web Interface</title>

     <conbody/>
   </concept>

   <concept id="intro_bi" audience="hidden">

     <title>Using Impala with Business Intelligence Tools</title>

     <conbody/>
   </concept>

   <concept id="intro_ha" audience="hidden">

     <title>Overview of Impala Availability and Fault Tolerance</title>

     <conbody/>
   </concept>

 <!-- This is pretty much ready to go. Decide if it should go under "Concepts" or "Performance",
      and if it should be split out into a separate file, and then take out the audience= attribute
      to make it visible.
 -->

   <concept id="intro_llvm" audience="hidden">

     <title>Overview of Impala Runtime Code Generation</title>

     <conbody>

 <!-- Adapted from the CIDR15 paper written by the Impala team. -->

       <p>
         Impala uses <term>LLVM</term> (a compiler library and collection of related tools) to perform just-in-time
         (JIT) compilation within the running <cmdname>impalad</cmdname> process. This runtime code generation
         technique improves query execution times by generating native code optimized for the architecture of each
         host in your particular cluster. Performance gains of 5 times or more are typical for representative
         workloads.
       </p>

       <p>
         Impala uses runtime code generation to produce query-specific versions of functions that are critical to
         performance. In particular, code generation is applied to <term>inner loop</term> functions, that is, those
         that are executed many times (for every tuple) in a given query, and thus constitute a large portion of the
         total time the query takes to execute. For example, when Impala scans a data file, it calls a function to
         parse each record into Impala’s in-memory tuple format. For queries scanning large tables, billions of
         records could result in billions of function calls. This function must therefore be extremely efficient for
         good query performance, and removing even a few instructions from each function call can result in large
         query speedups.
       </p>

       <p>
         Overall, JIT compilation has an effect similar to writing custom code to process a query. For example, it
         eliminates branches, unrolls loops, propagates constants, offsets and pointers, and inlines functions.
         Inlining is especially valuable for functions used internally to evaluate expressions, where the function
         call itself is more expensive than the function body (for example, a function that adds two numbers).
         Inlining functions also increases instruction-level parallelism, and allows the compiler to make further
         optimizations such as subexpression elimination across expressions.
       </p>

       <p>
         Impala generates runtime query code automatically, so you do not need to do anything special to get this
         performance benefit. This technique is most effective for complex and long-running queries that process
         large numbers of rows. If you need to issue a series of short, small queries, you might turn off this
         feature to avoid the overhead of compilation time for each query. In this case, issue the statement
         <codeph>SET DISABLE_CODEGEN=true</codeph> to turn off runtime code generation for the duration of the
         current session.
       </p>

 <!--
       <p>
         Without code generation,
         functions tend to be suboptimal
         to handle situations that cannot be predicted in advance.
         For example,
         a record-parsing function that
         only handles integer types will be faster at parsing an integer-only file
         than a function that handles other data types
         such as strings and floating-point numbers.
         However, the schemas of the files to
         be scanned are unknown at compile time,
         and so a general-purpose function must be used, even if at runtime
         it is known that more limited functionality is sufficient.
       </p>

       <p>
         A source of large runtime overheads are virtual functions. Virtual function calls incur a large performance
         penalty, particularly when the called function is very simple, as the calls cannot be inlined.
         If the type of the object instance is known at runtime, we can use code generation to replace the virtual
         function call with a call directly to the correct function, which can then be inlined. This is especially
         valuable when evaluating expression trees. In Impala (as in many systems), expressions are composed of a
         tree of individual operators and functions.
       </p>

       <p>
         Each type of expression that can appear in a query is implemented internally by overriding a virtual function.
         Many of these expression functions are quite simple, for example, adding two numbers.
         The virtual function call can be more expensive than the function body itself. By resolving the virtual
         function calls with code generation and then inlining the resulting function calls, Impala can evaluate expressions
         directly with no function call overhead. Inlining functions also increases
         instruction-level parallelism, and allows the compiler to make further optimizations such as subexpression
         elimination across expressions.
       </p>
 -->
     </conbody>
   </concept>

 <!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->

   <concept audience="hidden" id="intro_io">

     <title>Overview of Impala I/O</title>

     <conbody>

       <p>
         Efficiently retrieving data from HDFS is a challenge for all SQL-on-Hadoop systems. To perform
         data scans from both disk and memory at or near hardware speed, Impala uses an HDFS feature called
         <term>short-circuit local reads</term> to bypass the DataNode protocol when reading from local disk. Impala
         can read at almost disk bandwidth (approximately 100 MB/s per disk) and is typically able to saturate all
         available disks. For example, with 12 disks, Impala is typically capable of sustaining I/O at 1.2 GB/sec.
         Furthermore, <term>HDFS caching</term> allows Impala to access memory-resident data at memory bus speed,
         and saves CPU cycles as there is no need to copy or checksum data blocks within memory.
       </p>

       <p>
         The I/O manager component interfaces with storage devices to read and write data. I/O manager assigns a
         fixed number of worker threads per physical disk (currently one thread per rotational disk and eight per
         SSD), providing an asynchronous interface to clients (<term>scanner threads</term>).
       </p>
     </conbody>
   </concept>

 <!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->

 <!-- Although good idea to get some answers from Henry first. -->

   <concept audience="hidden" id="intro_state_distribution">

     <title>State distribution</title>

     <conbody>

       <p>
         As a massively parallel database that can run on hundreds of nodes, Impala must coordinate and synchronize
         its metadata across the entire cluster. Impala's symmetric-node architecture means that any node can accept
         and execute queries, and thus each node needs up-to-date versions of the system catalog and a knowledge of
         which hosts the <cmdname>impalad</cmdname> daemons run on. To avoid the overhead of TCP connections and
         remote procedure calls to retrieve metadata during query planning, Impala implements a simple
         publish-subscribe service called the <term>statestore</term> to push metadata changes to a set of
         subscribers (the <cmdname>impalad</cmdname> daemons running on all the DataNodes).
       </p>

       <p>
         The statestore maintains a set of topics, which are arrays of <codeph>(<varname>key</varname>,
         <varname>value</varname>, <varname>version</varname>)</codeph> triplets called <term>entries</term> where
         <varname>key</varname> and <varname>value</varname> are byte arrays, and <varname>version</varname> is a
         64-bit integer. A topic is defined by an application, and so the statestore has no understanding of the
         contents of any topic entry. Topics are persistent through the lifetime of the statestore, but are not
         persisted across service restarts. Processes that receive updates to any topic are called
         <term>subscribers</term>, and express their interest by registering with the statestore at startup and
         providing a list of topics. The statestore responds to registration by sending the subscriber an initial
         topic update for each registered topic, which consists of all the entries currently in that topic.
       </p>

 <!-- Henry: OK, but in practice, what is in these topic messages for Impala? -->

       <p>
         After registration, the statestore periodically sends two kinds of messages to each subscriber. The first
         kind of message is a topic update, and consists of all changes to a topic (new entries, modified entries
         and deletions) since the last update was successfully sent to the subscriber. Each subscriber maintains a
         per-topic most-recent-version identifier which allows the statestore to only send the delta between
         updates. In response to a topic update, each subscriber sends a list of changes it intends to make to its
         subscribed topics. Those changes are guaranteed to have been applied by the time the next update is
         received.
       </p>

       <p>
         The second kind of statestore message is a <term>heartbeat</term>, formerly sometimes called
         <term>keepalive</term>. The statestore uses heartbeat messages to maintain the connection to each
         subscriber, which would otherwise time out its subscription and attempt to re-register.
       </p>

       <p>
         Prior to Impala 2.0, both kinds of communication were combined in a single kind of message. Because these
         messages could be very large in instances with thousands of tables, partitions, data files, and so on,
         Impala 2.0 and higher divides the types of messages so that the small heartbeat pings can be transmitted
         and acknowledged quickly, increasing the reliability of the statestore mechanism that detects when Impala
         nodes become unavailable.
       </p>

       <p>
         If the statestore detects a failed subscriber (for example, by repeated failed heartbeat deliveries), it
         stops sending updates to that node.
 <!-- Henry: what are examples of these transient topic entries? -->
         Some topic entries are marked as transient, meaning that if their owning subscriber fails, they are
         removed.
       </p>

       <p>
         Although the asynchronous nature of this mechanism means that metadata updates might take some time to
         propagate across the entire cluster, that does not affect the consistency of query planning or results.
         Each query is planned and coordinated by a particular node, so as long as the coordinator node is aware of
         the existence of the relevant tables, data files, and so on, it can distribute the query work to other
         nodes even if those other nodes have not received the latest metadata updates.
 <!-- Henry: need another example here of what's in a topic, e.g. is it the list of available tables? -->
 <!--
         For example, query planning is performed on a single node based on the
         catalog metadata topic, and once a full plan has been computed, all information required to execute that
         plan is distributed directly to the executing nodes.
         There is no requirement that an executing node should
         know about the same version of the catalog metadata topic.
 -->
       </p>

       <p>
         We have found that the statestore process with default settings scales well to medium sized clusters, and
         can serve our largest deployments with some configuration changes.
 <!-- Henry: elaborate on the configuration changes. -->
       </p>

       <p>
 <!-- Henry: other examples like load information? How is load information used? -->
         The statestore does not persist any metadata to disk: all current metadata is pushed to the statestore by
         its subscribers (for example, load information). Therefore, should a statestore restart, its state can be
         recovered during the initial subscriber registration phase. Or if the machine that the statestore is
         running on fails, a new statestore process can be started elsewhere, and subscribers can fail over to it.
         There is no built-in failover mechanism in Impala, instead deployments commonly use a retargetable DNS
         entry to force subscribers to automatically move to the new process instance.
 <!-- Henry: translate that last sentence into instructions / guidelines. -->
       </p>
     </conbody>
   </concept>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="concepts">

	<title>Impala Concepts and Architecture</title>
	<titlealts audience="PDF"><navtitle>Concepts and Architecture</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Concepts"/>
	<data name="Category" value="Data Analysts"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Stub Pages"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	The following sections provide background information to help you become productive using Impala and
	its features. Where appropriate, the explanations include context to help understand how aspects of Impala
	relate to other technologies you might already be familiar with, such as relational database management
	systems and data warehouses, or other Hadoop components such as Hive, HDFS, and HBase.
	</p>

	<p outputclass="toc"/>
	</conbody>

	<!-- These other topics are waiting to be filled in. Could become subtopics or top-level topics depending on the depth of coverage in each case. -->

	<concept id="intro_data_lifecycle" audience="hidden">

	<title>Overview of the Data Lifecycle for Impala</title>

	<conbody/>
	</concept>

	<concept id="intro_etl" audience="hidden">

	<title>Overview of the Extract, Transform, Load (ETL) Process for Impala</title>
	<prolog>
	<metadata>
	<data name="Category" value="ETL"/>
	<data name="Category" value="Ingest"/>
	<data name="Category" value="Concepts"/>
	</metadata>
	</prolog>

	<conbody/>
	</concept>

	<concept id="intro_hadoop_data" audience="hidden">

	<title>How Impala Works with Hadoop Data Files</title>

	<conbody/>
	</concept>

	<concept id="intro_web_ui" audience="hidden">

	<title>Overview of the Impala Web Interface</title>

	<conbody/>
	</concept>

	<concept id="intro_bi" audience="hidden">

	<title>Using Impala with Business Intelligence Tools</title>

	<conbody/>
	</concept>

	<concept id="intro_ha" audience="hidden">

	<title>Overview of Impala Availability and Fault Tolerance</title>

	<conbody/>
	</concept>

	<!-- This is pretty much ready to go. Decide if it should go under "Concepts" or "Performance",
	and if it should be split out into a separate file, and then take out the audience= attribute
	to make it visible.
	-->

	<concept id="intro_llvm" audience="hidden">

	<title>Overview of Impala Runtime Code Generation</title>

	<conbody>

	<!-- Adapted from the CIDR15 paper written by the Impala team. -->

	<p>
	Impala uses <term>LLVM</term> (a compiler library and collection of related tools) to perform just-in-time
	(JIT) compilation within the running <cmdname>impalad</cmdname> process. This runtime code generation
	technique improves query execution times by generating native code optimized for the architecture of each
	host in your particular cluster. Performance gains of 5 times or more are typical for representative
	workloads.
	</p>

	<p>
	Impala uses runtime code generation to produce query-specific versions of functions that are critical to
	performance. In particular, code generation is applied to <term>inner loop</term> functions, that is, those
	that are executed many times (for every tuple) in a given query, and thus constitute a large portion of the
	total time the query takes to execute. For example, when Impala scans a data file, it calls a function to
	parse each record into Impala’s in-memory tuple format. For queries scanning large tables, billions of
	records could result in billions of function calls. This function must therefore be extremely efficient for
	good query performance, and removing even a few instructions from each function call can result in large
	query speedups.
	</p>

	<p>
	Overall, JIT compilation has an effect similar to writing custom code to process a query. For example, it
	eliminates branches, unrolls loops, propagates constants, offsets and pointers, and inlines functions.
	Inlining is especially valuable for functions used internally to evaluate expressions, where the function
	call itself is more expensive than the function body (for example, a function that adds two numbers).
	Inlining functions also increases instruction-level parallelism, and allows the compiler to make further
	optimizations such as subexpression elimination across expressions.
	</p>

	<p>
	Impala generates runtime query code automatically, so you do not need to do anything special to get this
	performance benefit. This technique is most effective for complex and long-running queries that process
	large numbers of rows. If you need to issue a series of short, small queries, you might turn off this
	feature to avoid the overhead of compilation time for each query. In this case, issue the statement
	<codeph>SET DISABLE_CODEGEN=true</codeph> to turn off runtime code generation for the duration of the
	current session.
	</p>

	<!--
	<p>
	Without code generation,
	functions tend to be suboptimal
	to handle situations that cannot be predicted in advance.
	For example,
	a record-parsing function that
	only handles integer types will be faster at parsing an integer-only file
	than a function that handles other data types
	such as strings and floating-point numbers.
	However, the schemas of the files to
	be scanned are unknown at compile time,
	and so a general-purpose function must be used, even if at runtime
	it is known that more limited functionality is sufficient.
	</p>

	<p>
	A source of large runtime overheads are virtual functions. Virtual function calls incur a large performance
	penalty, particularly when the called function is very simple, as the calls cannot be inlined.
	If the type of the object instance is known at runtime, we can use code generation to replace the virtual
	function call with a call directly to the correct function, which can then be inlined. This is especially
	valuable when evaluating expression trees. In Impala (as in many systems), expressions are composed of a
	tree of individual operators and functions.
	</p>

	<p>
	Each type of expression that can appear in a query is implemented internally by overriding a virtual function.
	Many of these expression functions are quite simple, for example, adding two numbers.
	The virtual function call can be more expensive than the function body itself. By resolving the virtual
	function calls with code generation and then inlining the resulting function calls, Impala can evaluate expressions
	directly with no function call overhead. Inlining functions also increases
	instruction-level parallelism, and allows the compiler to make further optimizations such as subexpression
	elimination across expressions.
	</p>
	-->
	</conbody>
	</concept>

	<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->

	<concept audience="hidden" id="intro_io">

	<title>Overview of Impala I/O</title>

	<conbody>

	<p>
	Efficiently retrieving data from HDFS is a challenge for all SQL-on-Hadoop systems. To perform
	data scans from both disk and memory at or near hardware speed, Impala uses an HDFS feature called
	<term>short-circuit local reads</term> to bypass the DataNode protocol when reading from local disk. Impala
	can read at almost disk bandwidth (approximately 100 MB/s per disk) and is typically able to saturate all
	available disks. For example, with 12 disks, Impala is typically capable of sustaining I/O at 1.2 GB/sec.
	Furthermore, <term>HDFS caching</term> allows Impala to access memory-resident data at memory bus speed,
	and saves CPU cycles as there is no need to copy or checksum data blocks within memory.
	</p>

	<p>
	The I/O manager component interfaces with storage devices to read and write data. I/O manager assigns a
	fixed number of worker threads per physical disk (currently one thread per rotational disk and eight per
	SSD), providing an asynchronous interface to clients (<term>scanner threads</term>).
	</p>
	</conbody>
	</concept>

	<!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. -->

	<!-- Although good idea to get some answers from Henry first. -->

	<concept audience="hidden" id="intro_state_distribution">

	<title>State distribution</title>

	<conbody>

	<p>
	As a massively parallel database that can run on hundreds of nodes, Impala must coordinate and synchronize
	its metadata across the entire cluster. Impala's symmetric-node architecture means that any node can accept
	and execute queries, and thus each node needs up-to-date versions of the system catalog and a knowledge of
	which hosts the <cmdname>impalad</cmdname> daemons run on. To avoid the overhead of TCP connections and
	remote procedure calls to retrieve metadata during query planning, Impala implements a simple
	publish-subscribe service called the <term>statestore</term> to push metadata changes to a set of
	subscribers (the <cmdname>impalad</cmdname> daemons running on all the DataNodes).
	</p>

	<p>
	The statestore maintains a set of topics, which are arrays of <codeph>(<varname>key</varname>,
	<varname>value</varname>, <varname>version</varname>)</codeph> triplets called <term>entries</term> where
	<varname>key</varname> and <varname>value</varname> are byte arrays, and <varname>version</varname> is a
	64-bit integer. A topic is defined by an application, and so the statestore has no understanding of the
	contents of any topic entry. Topics are persistent through the lifetime of the statestore, but are not
	persisted across service restarts. Processes that receive updates to any topic are called
	<term>subscribers</term>, and express their interest by registering with the statestore at startup and
	providing a list of topics. The statestore responds to registration by sending the subscriber an initial
	topic update for each registered topic, which consists of all the entries currently in that topic.
	</p>

	<!-- Henry: OK, but in practice, what is in these topic messages for Impala? -->

	<p>
	After registration, the statestore periodically sends two kinds of messages to each subscriber. The first
	kind of message is a topic update, and consists of all changes to a topic (new entries, modified entries
	and deletions) since the last update was successfully sent to the subscriber. Each subscriber maintains a
	per-topic most-recent-version identifier which allows the statestore to only send the delta between
	updates. In response to a topic update, each subscriber sends a list of changes it intends to make to its
	subscribed topics. Those changes are guaranteed to have been applied by the time the next update is
	received.
	</p>

	<p>
	The second kind of statestore message is a <term>heartbeat</term>, formerly sometimes called
	<term>keepalive</term>. The statestore uses heartbeat messages to maintain the connection to each
	subscriber, which would otherwise time out its subscription and attempt to re-register.
	</p>

	<p>
	Prior to Impala 2.0, both kinds of communication were combined in a single kind of message. Because these
	messages could be very large in instances with thousands of tables, partitions, data files, and so on,
	Impala 2.0 and higher divides the types of messages so that the small heartbeat pings can be transmitted
	and acknowledged quickly, increasing the reliability of the statestore mechanism that detects when Impala
	nodes become unavailable.
	</p>

	<p>
	If the statestore detects a failed subscriber (for example, by repeated failed heartbeat deliveries), it
	stops sending updates to that node.
	<!-- Henry: what are examples of these transient topic entries? -->
	Some topic entries are marked as transient, meaning that if their owning subscriber fails, they are
	removed.
	</p>

	<p>
	Although the asynchronous nature of this mechanism means that metadata updates might take some time to
	propagate across the entire cluster, that does not affect the consistency of query planning or results.
	Each query is planned and coordinated by a particular node, so as long as the coordinator node is aware of
	the existence of the relevant tables, data files, and so on, it can distribute the query work to other
	nodes even if those other nodes have not received the latest metadata updates.
	<!-- Henry: need another example here of what's in a topic, e.g. is it the list of available tables? -->
	<!--
	For example, query planning is performed on a single node based on the
	catalog metadata topic, and once a full plan has been computed, all information required to execute that
	plan is distributed directly to the executing nodes.
	There is no requirement that an executing node should
	know about the same version of the catalog metadata topic.
	-->
	</p>

	<p>
	We have found that the statestore process with default settings scales well to medium sized clusters, and
	can serve our largest deployments with some configuration changes.
	<!-- Henry: elaborate on the configuration changes. -->
	</p>

	<p>
	<!-- Henry: other examples like load information? How is load information used? -->
	The statestore does not persist any metadata to disk: all current metadata is pushed to the statestore by
	its subscribers (for example, load information). Therefore, should a statestore restart, its state can be
	recovered during the initial subscriber registration phase. Or if the machine that the statestore is
	running on fails, a new statestore process can be started elsewhere, and subscribers can fail over to it.
	There is no built-in failover mechanism in Impala, instead deployments commonly use a retargetable DNS
	entry to force subscribers to automatically move to the new process instance.
	<!-- Henry: translate that last sentence into instructions / guidelines. -->
	</p>
	</conbody>
	</concept>
	</concept>