| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="concepts"> |
| |
| <title>Impala Concepts and Architecture</title> |
| <titlealts audience="PDF"><navtitle>Concepts and Architecture</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Concepts"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Stub Pages"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| The following sections provide background information to help you become productive using Impala and |
| its features. Where appropriate, the explanations include context to help understand how aspects of Impala |
| relate to other technologies you might already be familiar with, such as relational database management |
| systems and data warehouses, or other Hadoop components such as Hive, HDFS, and HBase. |
| </p> |
| |
| <p outputclass="toc"/> |
| </conbody> |
| |
| <!-- These other topics are waiting to be filled in. Could become subtopics or top-level topics depending on the depth of coverage in each case. --> |
| |
| <concept id="intro_data_lifecycle" audience="hidden"> |
| |
| <title>Overview of the Data Lifecycle for Impala</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="intro_etl" audience="hidden"> |
| |
| <title>Overview of the Extract, Transform, Load (ETL) Process for Impala</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="ETL"/> |
| <data name="Category" value="Ingest"/> |
| <data name="Category" value="Concepts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="intro_hadoop_data" audience="hidden"> |
| |
| <title>How Impala Works with Hadoop Data Files</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="intro_web_ui" audience="hidden"> |
| |
| <title>Overview of the Impala Web Interface</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="intro_bi" audience="hidden"> |
| |
| <title>Using Impala with Business Intelligence Tools</title> |
| |
| <conbody/> |
| </concept> |
| |
| <concept id="intro_ha" audience="hidden"> |
| |
| <title>Overview of Impala Availability and Fault Tolerance</title> |
| |
| <conbody/> |
| </concept> |
| |
| <!-- This is pretty much ready to go. Decide if it should go under "Concepts" or "Performance", |
| and if it should be split out into a separate file, and then take out the audience= attribute |
| to make it visible. |
| --> |
| |
| <concept id="intro_llvm" audience="hidden"> |
| |
| <title>Overview of Impala Runtime Code Generation</title> |
| |
| <conbody> |
| |
| <!-- Adapted from the CIDR15 paper written by the Impala team. --> |
| |
| <p> |
| Impala uses <term>LLVM</term> (a compiler library and collection of related tools) to perform just-in-time |
| (JIT) compilation within the running <cmdname>impalad</cmdname> process. This runtime code generation |
| technique improves query execution times by generating native code optimized for the architecture of each |
| host in your particular cluster. Performance gains of 5 times or more are typical for representative |
| workloads. |
| </p> |
| |
| <p> |
| Impala uses runtime code generation to produce query-specific versions of functions that are critical to |
| performance. In particular, code generation is applied to <term>inner loop</term> functions, that is, those |
| that are executed many times (for every tuple) in a given query, and thus constitute a large portion of the |
| total time the query takes to execute. For example, when Impala scans a data file, it calls a function to |
| parse each record into Impala’s in-memory tuple format. For queries scanning large tables, billions of |
| records could result in billions of function calls. This function must therefore be extremely efficient for |
| good query performance, and removing even a few instructions from each function call can result in large |
| query speedups. |
| </p> |
| |
| <p> |
| Overall, JIT compilation has an effect similar to writing custom code to process a query. For example, it |
| eliminates branches, unrolls loops, propagates constants, offsets and pointers, and inlines functions. |
| Inlining is especially valuable for functions used internally to evaluate expressions, where the function |
| call itself is more expensive than the function body (for example, a function that adds two numbers). |
| Inlining functions also increases instruction-level parallelism, and allows the compiler to make further |
| optimizations such as subexpression elimination across expressions. |
| </p> |
| |
| <p> |
| Impala generates runtime query code automatically, so you do not need to do anything special to get this |
| performance benefit. This technique is most effective for complex and long-running queries that process |
| large numbers of rows. If you need to issue a series of short, small queries, you might turn off this |
| feature to avoid the overhead of compilation time for each query. In this case, issue the statement |
| <codeph>SET DISABLE_CODEGEN=true</codeph> to turn off runtime code generation for the duration of the |
| current session. |
| </p> |
| |
| <!-- |
| <p> |
| Without code generation, |
| functions tend to be suboptimal |
| to handle situations that cannot be predicted in advance. |
| For example, |
| a record-parsing function that |
| only handles integer types will be faster at parsing an integer-only file |
| than a function that handles other data types |
| such as strings and floating-point numbers. |
| However, the schemas of the files to |
| be scanned are unknown at compile time, |
| and so a general-purpose function must be used, even if at runtime |
| it is known that more limited functionality is sufficient. |
| </p> |
| |
| <p> |
| A source of large runtime overheads are virtual functions. Virtual function calls incur a large performance |
| penalty, particularly when the called function is very simple, as the calls cannot be inlined. |
| If the type of the object instance is known at runtime, we can use code generation to replace the virtual |
| function call with a call directly to the correct function, which can then be inlined. This is especially |
| valuable when evaluating expression trees. In Impala (as in many systems), expressions are composed of a |
| tree of individual operators and functions. |
| </p> |
| |
| <p> |
| Each type of expression that can appear in a query is implemented internally by overriding a virtual function. |
| Many of these expression functions are quite simple, for example, adding two numbers. |
| The virtual function call can be more expensive than the function body itself. By resolving the virtual |
| function calls with code generation and then inlining the resulting function calls, Impala can evaluate expressions |
| directly with no function call overhead. Inlining functions also increases |
| instruction-level parallelism, and allows the compiler to make further optimizations such as subexpression |
| elimination across expressions. |
| </p> |
| --> |
| </conbody> |
| </concept> |
| |
| <!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. --> |
| |
| <concept audience="hidden" id="intro_io"> |
| |
| <title>Overview of Impala I/O</title> |
| |
| <conbody> |
| |
| <p> |
| Efficiently retrieving data from HDFS is a challenge for all SQL-on-Hadoop systems. To perform |
| data scans from both disk and memory at or near hardware speed, Impala uses an HDFS feature called |
| <term>short-circuit local reads</term> to bypass the DataNode protocol when reading from local disk. Impala |
| can read at almost disk bandwidth (approximately 100 MB/s per disk) and is typically able to saturate all |
| available disks. For example, with 12 disks, Impala is typically capable of sustaining I/O at 1.2 GB/sec. |
| Furthermore, <term>HDFS caching</term> allows Impala to access memory-resident data at memory bus speed, |
| and saves CPU cycles as there is no need to copy or checksum data blocks within memory. |
| </p> |
| |
| <p> |
| The I/O manager component interfaces with storage devices to read and write data. I/O manager assigns a |
| fixed number of worker threads per physical disk (currently one thread per rotational disk and eight per |
| SSD), providing an asynchronous interface to clients (<term>scanner threads</term>). |
| </p> |
| </conbody> |
| </concept> |
| |
| <!-- Same as the previous section: adapted from CIDR paper, ready to externalize after deciding where to go. --> |
| |
| <!-- Although good idea to get some answers from Henry first. --> |
| |
| <concept audience="hidden" id="intro_state_distribution"> |
| |
| <title>State distribution</title> |
| |
| <conbody> |
| |
| <p> |
| As a massively parallel database that can run on hundreds of nodes, Impala must coordinate and synchronize |
| its metadata across the entire cluster. Impala's symmetric-node architecture means that any node can accept |
| and execute queries, and thus each node needs up-to-date versions of the system catalog and a knowledge of |
| which hosts the <cmdname>impalad</cmdname> daemons run on. To avoid the overhead of TCP connections and |
| remote procedure calls to retrieve metadata during query planning, Impala implements a simple |
| publish-subscribe service called the <term>statestore</term> to push metadata changes to a set of |
| subscribers (the <cmdname>impalad</cmdname> daemons running on all the DataNodes). |
| </p> |
| |
| <p> |
| The statestore maintains a set of topics, which are arrays of <codeph>(<varname>key</varname>, |
| <varname>value</varname>, <varname>version</varname>)</codeph> triplets called <term>entries</term> where |
| <varname>key</varname> and <varname>value</varname> are byte arrays, and <varname>version</varname> is a |
| 64-bit integer. A topic is defined by an application, and so the statestore has no understanding of the |
| contents of any topic entry. Topics are persistent through the lifetime of the statestore, but are not |
| persisted across service restarts. Processes that receive updates to any topic are called |
| <term>subscribers</term>, and express their interest by registering with the statestore at startup and |
| providing a list of topics. The statestore responds to registration by sending the subscriber an initial |
| topic update for each registered topic, which consists of all the entries currently in that topic. |
| </p> |
| |
| <!-- Henry: OK, but in practice, what is in these topic messages for Impala? --> |
| |
| <p> |
| After registration, the statestore periodically sends two kinds of messages to each subscriber. The first |
| kind of message is a topic update, and consists of all changes to a topic (new entries, modified entries |
| and deletions) since the last update was successfully sent to the subscriber. Each subscriber maintains a |
| per-topic most-recent-version identifier which allows the statestore to only send the delta between |
| updates. In response to a topic update, each subscriber sends a list of changes it intends to make to its |
| subscribed topics. Those changes are guaranteed to have been applied by the time the next update is |
| received. |
| </p> |
| |
| <p> |
| The second kind of statestore message is a <term>heartbeat</term>, formerly sometimes called |
| <term>keepalive</term>. The statestore uses heartbeat messages to maintain the connection to each |
| subscriber, which would otherwise time out its subscription and attempt to re-register. |
| </p> |
| |
| <p> |
| Prior to Impala 2.0, both kinds of communication were combined in a single kind of message. Because these |
| messages could be very large in instances with thousands of tables, partitions, data files, and so on, |
| Impala 2.0 and higher divides the types of messages so that the small heartbeat pings can be transmitted |
| and acknowledged quickly, increasing the reliability of the statestore mechanism that detects when Impala |
| nodes become unavailable. |
| </p> |
| |
| <p> |
| If the statestore detects a failed subscriber (for example, by repeated failed heartbeat deliveries), it |
| stops sending updates to that node. |
| <!-- Henry: what are examples of these transient topic entries? --> |
| Some topic entries are marked as transient, meaning that if their owning subscriber fails, they are |
| removed. |
| </p> |
| |
| <p> |
| Although the asynchronous nature of this mechanism means that metadata updates might take some time to |
| propagate across the entire cluster, that does not affect the consistency of query planning or results. |
| Each query is planned and coordinated by a particular node, so as long as the coordinator node is aware of |
| the existence of the relevant tables, data files, and so on, it can distribute the query work to other |
| nodes even if those other nodes have not received the latest metadata updates. |
| <!-- Henry: need another example here of what's in a topic, e.g. is it the list of available tables? --> |
| <!-- |
| For example, query planning is performed on a single node based on the |
| catalog metadata topic, and once a full plan has been computed, all information required to execute that |
| plan is distributed directly to the executing nodes. |
| There is no requirement that an executing node should |
| know about the same version of the catalog metadata topic. |
| --> |
| </p> |
| |
| <p> |
| We have found that the statestore process with default settings scales well to medium sized clusters, and |
| can serve our largest deployments with some configuration changes. |
| <!-- Henry: elaborate on the configuration changes. --> |
| </p> |
| |
| <p> |
| <!-- Henry: other examples like load information? How is load information used? --> |
| The statestore does not persist any metadata to disk: all current metadata is pushed to the statestore by |
| its subscribers (for example, load information). Therefore, should a statestore restart, its state can be |
| recovered during the initial subscriber registration phase. Or if the machine that the statestore is |
| running on fails, a new statestore process can be started elsewhere, and subscribers can fail over to it. |
| There is no built-in failover mechanism in Impala, instead deployments commonly use a retargetable DNS |
| entry to force subscribers to automatically move to the new process instance. |
| <!-- Henry: translate that last sentence into instructions / guidelines. --> |
| </p> |
| </conbody> |
| </concept> |
| </concept> |