docs/topics/impala_scalability.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?><!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="scalability">

   <title>Scalability Considerations for Impala</title>
   <titlealts audience="PDF"><navtitle>Scalability Considerations</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Performance"/>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Planning"/>
       <data name="Category" value="Querying"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Memory"/>
       <data name="Category" value="Scalability"/>
       <!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark as 'Proof of Concept'. -->
       <data name="Category" value="Proof of Concept"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       This section explains how the size of your cluster and the volume of data influences SQL performance and
       schema design for Impala tables. Typically, adding more cluster capacity reduces problems due to memory
       limits or disk throughput. On the other hand, larger clusters are more likely to have other kinds of
       scalability issues, such as a single slow node that causes performance problems for queries.
     </p>

     <p outputclass="toc inpage"/>

     <p conref="../shared/impala_common.xml#common/cookbook_blurb"/>

   </conbody>

   <concept audience="hidden" id="scalability_memory">

     <title>Overview and Guidelines for Impala Memory Usage</title>
   <prolog>
     <metadata>
       <data name="Category" value="Memory"/>
       <data name="Category" value="Concepts"/>
       <data name="Category" value="Best Practices"/>
       <data name="Category" value="Guidelines"/>
     </metadata>
   </prolog>

     <conbody>

 <!--
 Outline adapted from Alan Choi's "best practices" and/or "performance cookbook" papers.
 -->

 <codeblock>Memory Usage – the Basics
 *  Memory is used by:
 *  Hash join – RHS tables after decompression, filtering and projection
 *  Group by – proportional to the #groups
 *  Parquet writer buffer – 1GB per partition
 *  IO buffer (shared across queries)
 *  Metadata cache (no more than 1GB typically)
 *  Memory held and reused by later query
 *  Impala releases memory from time to time starting in 1.4.

 Memory Usage – Estimating Memory Usage
 *  Use Explain Plan
 * Requires statistics! Mem estimate without stats is meaningless.
 * Reports per-host memory requirement for this cluster size.
 *  Re-run if you’ve re-sized the cluster!
 [image of explain plan]

 Memory Usage – Estimating Memory Usage
 *  EXPLAIN’s memory estimate issues
 *  Can be way off – much higher or much lower.
 *  group by’s estimate can be particularly off – when there’s a large number of group by columns.
 *  Mem estimate = NDV of group by column 1 * NDV of group by column 2 * ... NDV of group by column n
 *  Ignore EXPLAIN’s estimate if it’s too high! •  Do your own estimate for group by
 *  GROUP BY mem usage = (total number of groups * size of each row) + (total number of groups * size of each row) / num node

 Memory Usage – Finding Actual Memory Usage
 *  Search for “Per Node Peak Memory Usage” in the profile.
 This is accurate. Use it for production capacity planning.

 Memory Usage – Actual Memory Usage
 *  For complex queries, how do I know which part of my query is using too much memory?
 *  Use the ExecSummary from the query profile!
 - But is that "Peak Mem" number aggregate or per-node?
 [image of executive summary]

 Memory Usage – Hitting Mem-limit
 *  Top causes (in order) of hitting mem-limit even when running a single query:
 1. Lack of statistics
 2. Lots of joins within a single query
 3. Big-table joining big-table
 4. Gigantic group by

 Memory Usage – Hitting Mem-limit
 Lack of stats
 *  Wrong join order, wrong join strategy, wrong insert strategy
 *  Explain Plan tells you that!
 [image of explain plan]
 *  Fix: Compute Stats table

 Memory Usage – Hitting Mem-limit
 Lots of joins within a single query
 * select...from fact, dim1, dim2,dim3,...dimN where ...
 * Each dim tbl can fit in memory, but not all of them together
 * As of Impala 1.4, Impala might choose the wrong plan – BROADCAST
 FIX 1: use shuffle hint
 select ... from fact join [shuffle] dim1 on ... join dim2 [shuffle] ...
 FIX 2: pre-join the dim tables (if possible)
 - How about an example to illustrate that technique?
 * few join=&gt;better perf!

 Memory Usage: Hitting Mem-limit
 Big-table joining big-table
 *  Big-table (after decompression, filtering, and projection) is a table that is bigger than total cluster memory size.
 *  Impala 2.0 will do this (via disk-based join). Consider using Hive for now.
 *  (Advanced) For a simple query, you can try this advanced workaround – per-partition join
 *  Requires the partition key be part of the join key
 select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3)
    union all
 select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6)

 Memory Usage: Hitting Mem-limit
 Gigantic group by
 * The total number of distinct groups is huge, such as group by userid.
 * Impala 2.0 will do this (via disk-based agg). Consider using Hive for now.
 - Is this one of the cases where people were unhappy we recommended Hive?
 * (Advanced) For a simple query, you can try this advanced workaround – per-partition agg
 *  Requires the partition key be part of the group by
 select part_key, col1, col2, ...agg(..) from tbl where
        part_key in (1,2,3)
        Union all
        Select part_key, col1, col2, ...agg(..) from tbl where
        part_key in (4,5,6)
 - But where's the GROUP BY in the preceding query? Need a real example.

 Memory Usage: Additional Notes
 *  Use explain plan for estimate; use profile for accurate measure
 *  Data skew can use uneven memory usage
 *  Review previous common issues on out-of-memory
 *  Note: Even with disk-based joins, you'll want to review these steps to speed up queries and use memory more efficiently
 </codeblock>
     </conbody>
   </concept>

   <concept id="scalability_catalog">

     <title>Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage</title>

     <conbody>

       <p audience="hidden">
         Details to fill in in future: Impact of <q>load catalog in background</q> option.
         Changing timeouts.
       </p>

       <p>
         Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized for tables
         containing relatively few, large data files. Schemas containing thousands of tables, or tables containing
         thousands of partitions, can encounter performance issues during startup or during DDL operations such as
         <codeph>ALTER TABLE</codeph> statements.
       </p>

       <note type="important" rev="TSB-168">
       <p>
         Because of a change in the default heap size for the <cmdname>catalogd</cmdname> daemon in
         <keyword keyref="impala25_full"/> and higher, the following procedure to increase the <cmdname>catalogd</cmdname>
         memory limit might be required following an upgrade to <keyword keyref="impala25_full"/> even if not
         needed previously.
       </p>
       </note>

       <p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"/>

     </conbody>
   </concept>

   <concept rev="2.1.0" id="statestore_scalability">

     <title>Scalability Considerations for the Impala Statestore</title>

     <conbody>

       <p>
         Before <keyword keyref="impala21_full"/>, the statestore sent only one kind of message to its subscribers. This message contained all
         updates for any topics that a subscriber had subscribed to. It also served to let subscribers know that the
         statestore had not failed, and conversely the statestore used the success of sending a heartbeat to a
         subscriber to decide whether or not the subscriber had failed.
       </p>

       <p>
         Combining topic updates and failure detection in a single message led to bottlenecks in clusters with large
         numbers of tables, partitions, and HDFS data blocks. When the statestore was overloaded with metadata
         updates to transmit, heartbeat messages were sent less frequently, sometimes causing subscribers to time
         out their connection with the statestore. Increasing the subscriber timeout and decreasing the frequency of
         statestore heartbeats worked around the problem, but reduced responsiveness when the statestore failed or
         restarted.
       </p>

       <p>
         As of <keyword keyref="impala21_full"/>, the statestore now sends topic updates and heartbeats in separate messages. This allows the
         statestore to send and receive a steady stream of lightweight heartbeats, and removes the requirement to
         send topic updates according to a fixed schedule, reducing statestore network overhead.
       </p>

       <p>
         The statestore now has the following relevant configuration flags for the <cmdname>statestored</cmdname>
         daemon:
       </p>

       <dl>
         <dlentry id="statestore_num_update_threads">

           <dt>
             <codeph>-statestore_num_update_threads</codeph>
           </dt>

           <dd>
             The number of threads inside the statestore dedicated to sending topic updates. You should not
             typically need to change this value.
             <p>
               <b>Default:</b> 10
             </p>
           </dd>

         </dlentry>

         <dlentry id="statestore_update_frequency_ms">

           <dt>
             <codeph>-statestore_update_frequency_ms</codeph>
           </dt>

           <dd>
             The frequency, in milliseconds, with which the statestore tries to send topic updates to each
             subscriber. This is a best-effort value; if the statestore is unable to meet this frequency, it sends
             topic updates as fast as it can. You should not typically need to change this value.
             <p>
               <b>Default:</b> 2000
             </p>
           </dd>

         </dlentry>

         <dlentry id="statestore_num_heartbeat_threads">

           <dt>
             <codeph>-statestore_num_heartbeat_threads</codeph>
           </dt>

           <dd>
             The number of threads inside the statestore dedicated to sending heartbeats. You should not typically
             need to change this value.
             <p>
               <b>Default:</b> 10
             </p>
           </dd>

         </dlentry>

         <dlentry id="statestore_heartbeat_frequency_ms">

           <dt>
             <codeph>-statestore_heartbeat_frequency_ms</codeph>
           </dt>

           <dd>
             The frequency, in milliseconds, with which the statestore tries to send heartbeats to each subscriber.
             This value should be good for large catalogs and clusters up to approximately 150 nodes. Beyond that,
             you might need to increase this value to make the interval longer between heartbeat messages.
             <p>
               <b>Default:</b> 1000 (one heartbeat message every second)
             </p>
           </dd>

         </dlentry>
       </dl>

       <p>
         If it takes a very long time for a cluster to start up, and <cmdname>impala-shell</cmdname> consistently
         displays <codeph>This Impala daemon is not ready to accept user requests</codeph>, the statestore might be
         taking too long to send the entire catalog topic to the cluster. In this case, consider adding
         <codeph>--load_catalog_in_background=false</codeph> to your catalog service configuration. This setting
         stops the statestore from loading the entire catalog into memory at cluster startup. Instead, metadata for
         each table is loaded when the table is accessed for the first time.
       </p>
     </conbody>
   </concept>

   <concept id="scalability_coordinator" rev="2.9.0 IMPALA-3807 IMPALA-5147 IMPALA-5503">

     <title>Controlling which Hosts are Coordinators and Executors</title>

     <conbody>

       <p>
         By default, each host in the cluster that runs the <cmdname>impalad</cmdname>
         daemon can act as the coordinator for an Impala query, execute the fragments
         of the execution plan for the query, or both. During highly concurrent
         workloads for large-scale queries, especially on large clusters, the dual
         roles can cause scalability issues:
       </p>

       <ul>
         <li>
           <p>
             The extra work required for a host to act as the coordinator could interfere
             with its capacity to perform other work for the earlier phases of the query.
             For example, the coordinator can experience significant network and CPU overhead
             during queries containing a large number of query fragments. Each coordinator
             caches metadata for all table partitions and data files, which can be substantial
             and contend with memory needed to process joins, aggregations, and other operations
             performed by query executors.
           </p>
         </li>
         <li>
           <p>
             Having a large number of hosts act as coordinators can cause unnecessary network
             overhead, or even timeout errors, as each of those hosts communicates with the
             <cmdname>statestored</cmdname> daemon for metadata updates.
           </p>
         </li>
         <li>
           <p>
             The <q>soft limits</q> imposed by the admission control feature are more likely
             to be exceeded when there are a large number of heavily loaded hosts acting as
             coordinators.
           </p>
         </li>
       </ul>

       <p>
         If such scalability bottlenecks occur, you can explicitly specify that certain
         hosts act as query coordinators, but not executors for query fragments.
         These hosts do not participate in I/O-intensive operations such as scans,
         and CPU-intensive operations such as aggregations.
       </p>

       <p>
         Then, you specify that the
         other hosts act as executors but not coordinators. These hosts do not communicate
         with the <cmdname>statestored</cmdname> daemon or process the final result sets
         from queries. You cannot connect to these hosts through clients such as
         <cmdname>impala-shell</cmdname> or business intelligence tools.
       </p>

       <p>
         This feature is available in <keyword keyref="impala29_full"/> and higher.
       </p>

       <p>
         To use this feature, you specify one of the following startup flags for the
         <cmdname>impalad</cmdname> daemon on each host:
       </p>

       <ul>
         <li>
           <p>
             <codeph>is_executor=false</codeph> for each host that
             does not act as an executor for Impala queries.
             These hosts act exclusively as query coordinators.
             This setting typically applies to a relatively small number of
             hosts, because the most common topology is to have nearly all
             DataNodes doing work for query execution.
           </p>
         </li>
         <li>
           <p>
             <codeph>is_coordinator=false</codeph> for each host that
             does not act as a coordinator for Impala queries.
             These hosts act exclusively as executors.
             The number of hosts with this setting typically increases
             as the cluster grows larger and handles more table partitions,
             data files, and concurrent queries. As the overhead for query
             coordination increases, it becomes more important to centralize
             that work on dedicated hosts.
           </p>
         </li>
       </ul>

       <p>
         By default, both of these settings are enabled for each <codeph>impalad</codeph>
         instance, allowing all such hosts to act as both executors and coordinators.
       </p>

       <p>
         For example, on a 100-node cluster, you might specify <codeph>is_executor=false</codeph>
         for 10 hosts, to dedicate those hosts as query coordinators. Then specify
         <codeph>is_coordinator=false</codeph> for the remaining 90 hosts. All explicit or
         load-balanced connections must go to the 10 hosts acting as coordinators. These hosts
         perform the network communication to keep metadata up-to-date and route query results
         to the appropriate clients. The remaining 90 hosts perform the intensive I/O, CPU, and
         memory operations that make up the bulk of the work for each query. If a bottleneck or
         other performance issue arises on a specific host, you can narrow down the cause more
         easily because each host is dedicated to specific operations within the overall
         Impala workload.
       </p>

     </conbody>
   </concept>

   <concept audience="hidden" id="scalability_cluster_size">

     <title>Scalability Considerations for Impala Cluster Size and Topology</title>

     <conbody>

       <p>
       </p>
     </conbody>
   </concept>

   <concept audience="hidden" id="concurrent_connections">

     <title>Scaling the Number of Concurrent Connections</title>

     <conbody>

       <p></p>
     </conbody>
   </concept>

   <concept rev="2.0.0" id="spill_to_disk">

     <title>SQL Operations that Spill to Disk</title>

     <conbody>

       <p>
         Certain memory-intensive operations write temporary data to disk (known as <term>spilling</term> to disk)
         when Impala is close to exceeding its memory limit on a particular host.
       </p>

       <p>
         The result is a query that completes successfully, rather than failing with an out-of-memory error. The
         tradeoff is decreased performance due to the extra disk I/O to write the temporary data and read it back
         in. The slowdown could be potentially be significant. Thus, while this feature improves reliability,
         you should optimize your queries, system parameters, and hardware configuration to make this spilling a rare occurrence.
       </p>

       <p>
         <b>What kinds of queries might spill to disk:</b>
       </p>

       <p>
         Several SQL clauses and constructs require memory allocations that could activat the spilling mechanism:
       </p>
       <ul>
         <li>
           <p>
             when a query uses a <codeph>GROUP BY</codeph> clause for columns
             with millions or billions of distinct values, Impala keeps a
             similar number of temporary results in memory, to accumulate the
             aggregate results for each value in the group.
           </p>
         </li>
         <li>
           <p>
             When large tables are joined together, Impala keeps the values of
             the join columns from one table in memory, to compare them to
             incoming values from the other table.
           </p>
         </li>
         <li>
           <p>
             When a large result set is sorted by the <codeph>ORDER BY</codeph>
             clause, each node sorts its portion of the result set in memory.
           </p>
         </li>
         <li>
           <p>
             The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators
             build in-memory data structures to represent all values found so
             far, to eliminate duplicates as the query progresses.
           </p>
         </li>
         <!-- JIRA still in open state as of 5.8 / 2.6, commenting out.
         <li>
           <p rev="IMPALA-3471">
             In <keyword keyref="impala26_full"/> and higher, <term>top-N</term> queries (those with
             <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses) can also spill.
             Impala allocates enough memory to hold as many rows as specified by the <codeph>LIMIT</codeph>
             clause, plus enough memory to hold as many rows as specified by any <codeph>OFFSET</codeph> clause.
           </p>
         </li>
         -->
       </ul>

       <p conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/>

       <p>
         <b>How Impala handles scratch disk space for spilling:</b>
       </p>

       <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/>

       <p>
         <b>Memory usage for SQL operators:</b>
       </p>

       <p>
         The infrastructure of the spilling feature affects the way the affected SQL operators, such as
         <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory.
         On each host that participates in the query, each such operator in a query accumulates memory
         while building the data structure to process the aggregation or join operation. The amount
         of memory used depends on the portion of the data being handled by that host, and thus might
         be different from one host to another. When the amount of memory being used for the operator
         on a particular host reaches a threshold amount, Impala reserves an additional memory buffer
         to use as a work area in case that operator causes the query to exceed the memory limit for
         that host. After allocating the memory buffer, the memory used by that operator remains
         essentially stable or grows only slowly, until the point where the memory limit is reached
         and the query begins writing temporary data to disk.
       </p>

       <p rev="2.2.0">
         Prior to Impala 2.2, the extra memory buffer for an operator that might spill to disk
         was allocated when the data structure used by the applicable SQL operator reaches 16 MB in size,
         and the memory buffer itself was 512 MB. In Impala 2.2, these values are halved: the threshold value
         is 8 MB and the memory buffer is 256 MB. <ph rev="2.3.0">In <keyword keyref="impala23_full"/> and higher, the memory for the buffer
         is allocated in pieces, only as needed, to avoid sudden large jumps in memory usage.</ph> A query that uses
         multiple such operators might allocate multiple such memory buffers, as the size of the data structure
         for each operator crosses the threshold on a particular host.
       </p>

       <p>
         Therefore, a query that processes a relatively small amount of data on each host would likely
         never reach the threshold for any operator, and would never allocate any extra memory buffers. A query
         that did process millions of groups, distinct values, join keys, and so on might cross the threshold,
         causing its memory requirement to rise suddenly and then flatten out. The larger the cluster, less data is processed
         on any particular host, thus reducing the chance of requiring the extra memory allocation.
       </p>

       <p>
         <b>Added in:</b> This feature was added to the <codeph>ORDER BY</codeph> clause in Impala 1.4.
         This feature was extended to cover join queries, aggregation functions, and analytic
         functions in Impala 2.0. The size of the memory work area required by
         each operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2.
       </p>

       <p>
         <b>Avoiding queries that spill to disk:</b>
       </p>

       <p>
         Because the extra I/O can impose significant performance overhead on these types of queries, try to avoid
         this situation by using the following steps:
       </p>

       <ol>
         <li>
           Detect how often queries spill to disk, and how much temporary data is written. Refer to the following
           sources:
           <ul>
             <li>
               The output of the <codeph>PROFILE</codeph> command in the <cmdname>impala-shell</cmdname>
               interpreter. This data shows the memory usage for each host and in total across the cluster. The
               <codeph>BlockMgr.BytesWritten</codeph> counter reports how much data was written to disk during the
               query.
             </li>

             <li>
               The <uicontrol>Queries</uicontrol> tab in the Impala debug web user interface. Select the query to
               examine and click the corresponding <uicontrol>Profile</uicontrol> link. This data breaks down the
               memory usage for a single host within the cluster, the host whose web interface you are connected to.
             </li>
           </ul>
         </li>

         <li>
           Use one or more techniques to reduce the possibility of the queries spilling to disk:
           <ul>
             <li>
               Increase the Impala memory limit if practical, for example, if you can increase the available memory
               by more than the amount of temporary data written to disk on a particular node. Remember that in
               Impala 2.0 and later, you can issue <codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets you
               fine-tune the memory usage for queries from JDBC and ODBC applications.
             </li>

             <li>
               Increase the number of nodes in the cluster, to increase the aggregate memory available to Impala and
               reduce the amount of memory required on each node.
             </li>

             <li>
               Increase the overall memory capacity of each DataNode at the hardware level.
             </li>

             <li>
               On a cluster with resources shared between Impala and other Hadoop components, use resource
               management features to allocate more memory for Impala. See
               <xref href="impala_resource_management.xml#resource_management"/> for details.
             </li>

             <li>
               If the memory pressure is due to running many concurrent queries rather than a few memory-intensive
               ones, consider using the Impala admission control feature to lower the limit on the number of
               concurrent queries. By spacing out the most resource-intensive queries, you can avoid spikes in
               memory usage and improve overall response times. See
               <xref href="impala_admission.xml#admission_control"/> for details.
             </li>

             <li>
               Tune the queries with the highest memory requirements, using one or more of the following techniques:
               <ul>
                 <li>
                   Run the <codeph>COMPUTE STATS</codeph> statement for all tables involved in large-scale joins and
                   aggregation queries.
                 </li>

                 <li>
                   Minimize your use of <codeph>STRING</codeph> columns in join columns. Prefer numeric values
                   instead.
                 </li>

                 <li>
                   Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy being used for the
                   most resource-intensive queries. See <xref href="impala_explain_plan.xml#perf_explain"/> for
                   details.
                 </li>

                 <li>
                   If Impala still chooses a suboptimal execution strategy even with statistics available, or if it
                   is impractical to keep the statistics up to date for huge or rapidly changing tables, add hints
                   to the most resource-intensive queries to select the right execution strategy. See
                   <xref href="impala_hints.xml#hints"/> for details.
                 </li>
               </ul>
             </li>

             <li>
               If your queries experience substantial performance overhead due to spilling, enable the
               <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option prevents queries whose memory usage
               is likely to be exorbitant from spilling to disk. See
               <xref href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/> for details. As you tune
               problematic queries using the preceding steps, fewer and fewer will be cancelled by this option
               setting.
             </li>
           </ul>
         </li>
       </ol>

       <p>
         <b>Testing performance implications of spilling to disk:</b>
       </p>

       <p>
         To artificially provoke spilling, to test this feature and understand the performance implications, use a
         test environment with a memory limit of at least 2 GB. Issue the <codeph>SET</codeph> command with no
         arguments to check the current setting for the <codeph>MEM_LIMIT</codeph> query option. Set the query
         option <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the spill-to-disk feature to prevent
         runaway disk usage from queries that are known in advance to be suboptimal. Within
         <cmdname>impala-shell</cmdname>, run a query that you expect to be memory-intensive, based on the criteria
         explained earlier. A self-join of a large table is a good candidate:
       </p>

 <codeblock>select count(*) from big_table a join big_table b using (column_with_many_values);
 </codeblock>

       <p>
         Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown of the memory usage on each node
         during the query. The crucial part of the profile output concerning memory is the <codeph>BlockMgr</codeph>
         portion. For example, this profile shows that the query did not quite exceed the memory limit.
       </p>

 <codeblock>BlockMgr:
    - BlockWritesIssued: 1
    - BlockWritesOutstanding: 0
    - BlocksCreated: 24
    - BlocksRecycled: 1
    - BufferedPins: 0
    - MaxBlockSize: 8.00 MB (8388608)
    <b>- MemoryLimit: 200.00 MB (209715200)</b>
    <b>- PeakMemoryUsage: 192.22 MB (201555968)</b>
    - TotalBufferWaitTime: 0ns
    - TotalEncryptionTime: 0ns
    - TotalIntegrityCheckTime: 0ns
    - TotalReadBlockTime: 0ns
 </codeblock>

       <p>
         In this case, because the memory limit was already below any recommended value, I increased the volume of
         data for the query rather than reducing the memory limit any further.
       </p>

       <p>
         Set the <codeph>MEM_LIMIT</codeph> query option to a value that is smaller than the peak memory usage
         reported in the profile output. Do not specify a memory limit lower than about 300 MB, because with such a
         low limit, queries could fail to start for other reasons. Now try the memory-intensive query again.
       </p>

       <p>
         Check if the query fails with a message like the following:
       </p>

 <codeblock>WARNINGS: Spilling has been disabled for plans that do not have stats and are not hinted
 to prevent potentially bad plans from using too many cluster resources. Compute stats on
 these tables, hint the plan or disable this behavior via query options to enable spilling.
 </codeblock>

       <p>
         If so, the query could have consumed substantial temporary disk space, slowing down so much that it would
         not complete in any reasonable time. Rather than rely on the spill-to-disk feature in this case, issue the
         <codeph>COMPUTE STATS</codeph> statement for the table or tables in your sample query. Then run the query
         again, check the peak memory usage again in the <codeph>PROFILE</codeph> output, and adjust the memory
         limit again if necessary to be lower than the peak memory usage.
       </p>

       <p>
         At this point, you have a query that is memory-intensive, but Impala can optimize it efficiently so that
         the memory usage is not exorbitant. You have set an artificial constraint through the
         <codeph>MEM_LIMIT</codeph> option so that the query would normally fail with an out-of-memory error. But
         the automatic spill-to-disk feature means that the query should actually succeed, at the expense of some
         extra disk I/O to read and write temporary work data.
       </p>

       <p>
         Try the query again, and confirm that it succeeds. Examine the <codeph>PROFILE</codeph> output again. This
         time, look for lines of this form:
       </p>

 <codeblock>- SpilledPartitions: <varname>N</varname>
 </codeblock>

       <p>
         If you see any such lines with <varname>N</varname> greater than 0, that indicates the query would have
         failed in Impala releases prior to 2.0, but now it succeeded because of the spill-to-disk feature. Examine
         the total time taken by the <codeph>AGGREGATION_NODE</codeph> or other query fragments containing non-zero
         <codeph>SpilledPartitions</codeph> values. Compare the times to similar fragments that did not spill, for
         example in the <codeph>PROFILE</codeph> output when the same query is run with a higher memory limit. This
         gives you an idea of the performance penalty of the spill operation for a particular query with a
         particular memory limit. If you make the memory limit just a little lower than the peak memory usage, the
         query only needs to write a small amount of temporary data to disk. The lower you set the memory limit, the
         more temporary data is written and the slower the query becomes.
       </p>

       <p>
         Now repeat this procedure for actual queries used in your environment. Use the
         <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where queries used more memory than
         necessary due to lack of statistics on the relevant tables and columns, and issue <codeph>COMPUTE
         STATS</codeph> where necessary.
       </p>

       <p>
         <b>When to use DISABLE_UNSAFE_SPILLS:</b>
       </p>

       <p>
         You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned on all the time. Whether and
         how frequently to use this option depends on your system environment and workload.
       </p>

       <p>
         <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment with ad hoc queries whose performance
         characteristics and memory usage are not known in advance. It prevents <q>worst-case scenario</q> queries
         that use large amounts of memory unnecessarily. Thus, you might turn this option on within a session while
         developing new SQL code, even though it is turned off for existing applications.
       </p>

       <p>
         Organizations where table and column statistics are generally up-to-date might leave this option turned on
         all the time, again to avoid worst-case scenarios for untested queries or if a problem in the ETL pipeline
         results in a table with no statistics. Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail
         fast</q> in this case and immediately gather statistics or tune the problematic queries.
       </p>

       <p>
         Some organizations might leave this option turned off. For example, you might have tables large enough that
         the <codeph>COMPUTE STATS</codeph> takes substantial time to run, making it impractical to re-run after
         loading new data. If you have examined the <codeph>EXPLAIN</codeph> plans of your queries and know that
         they are operating efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that
         case, you know that any queries that spill will not go overboard with their memory consumption.
       </p>

     </conbody>
   </concept>

 <concept id="complex_query">
 <title>Limits on Query Size and Complexity</title>
 <conbody>
 <p>
 There are hardcoded limits on the maximum size and complexity of queries.
 Currently, the maximum number of expressions in a query is 2000.
 You might exceed the limits with large or deeply nested queries
 produced by business intelligence tools or other query generators.
 </p>
 <p>
 If you have the ability to customize such queries or the query generation
 logic that produces them, replace sequences of repetitive expressions
 with single operators such as <codeph>IN</codeph> or <codeph>BETWEEN</codeph>
 that can represent multiple values or ranges.
 For example, instead of a large number of <codeph>OR</codeph> clauses:
 </p>
 <codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ...
 </codeblock>
 <p>
 use a single <codeph>IN</codeph> clause:
 </p>
 <codeblock>WHERE val IN (1,2,6,100,...)</codeblock>
 </conbody>
 </concept>

 <concept id="scalability_io">
 <title>Scalability Considerations for Impala I/O</title>
 <conbody>
 <p>
 Impala parallelizes its I/O operations aggressively,
 therefore the more disks you can attach to each host, the better.
 Impala retrieves data from disk so quickly using
 bulk read operations on large blocks, that most queries
 are CPU-bound rather than I/O-bound.
 </p>
 <p>
 Because the kind of sequential scanning typically done by
 Impala queries does not benefit much from the random-access
 capabilities of SSDs, spinning disks typically provide
 the most cost-effective kind of storage for Impala data,
 with little or no performance penalty as compared to SSDs.
 </p>
 <p>
 Resource management features such as YARN, Llama, and admission control
 typically constrain the amount of memory, CPU, or overall number of
 queries in a high-concurrency environment.
 Currently, there is no throttling mechanism for Impala I/O.
 </p>
 </conbody>
 </concept>

   <concept id="big_tables">
     <title>Scalability Considerations for Table Layout</title>
     <conbody>
       <p>
         Due to the overhead of retrieving and updating table metadata
         in the metastore database, try to limit the number of columns
         in a table to a maximum of approximately 2000.
         Although Impala can handle wider tables than this, the metastore overhead
         can become significant, leading to query performance that is slower
         than expected based on the actual data volume.
       </p>
       <p>
         To minimize overhead related to the metastore database and Impala query planning,
         try to limit the number of partitions for any partitioned table to a few tens of thousands.
       </p>
       <p rev="IMPALA-5309">
         If the volume of data within a table makes it impractical to run exploratory
         queries, consider using the <codeph>TABLESAMPLE</codeph> clause to limit query processing
         to only a percentage of data within the table. This technique reduces the overhead
         for query startup, I/O to read the data, and the amount of network, CPU, and memory
         needed to process intermediate results during the query. See <xref keyref="tablesample"/>
         for details.
       </p>
     </conbody>
   </concept>

 <concept rev="" id="kerberos_overhead_cluster_size">
 <title>Kerberos-Related Network Overhead for Large Clusters</title>
 <conbody>
 <p>
 When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala sends a number of
 simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might be able to process
 all the requests within roughly 5 seconds. For a cluster with 1000 hosts, the time to process
 the requests would be roughly 500 seconds. Impala also makes a number of DNS requests at the same
 time as these Kerberos-related requests.
 </p>
 <p>
 While these authentication requests are being processed, any submitted Impala queries will fail.
 During this period, the KDC and DNS may be slow to respond to requests from components other than Impala,
 so other secure services might be affected temporarily.
 </p>

 <p>
   To reduce the frequency  of the <codeph>kinit</codeph> renewal that initiates
   a new set of authentication requests, increase the <codeph>kerberos_reinit_interval</codeph>
   configuration setting for the <cmdname>impalad</cmdname> daemons. Currently, the default is 60 minutes.
   Consider using a higher value such as 360 (6 hours).
 </p>

 </conbody>
 </concept>

   <concept rev="IMPALA-2294" id="kerberos_overhead_memory_usage">
   <title>Kerberos-Related Memory Overhead for Large Clusters</title>
   <conbody>
     <p conref="../shared/impala_common.xml#common/vm_overcommit_memory_intro"/>
     <p conref="../shared/impala_common.xml#common/vm_overcommit_memory_start" conrefend="vm_overcommit_memory_end"/>
   </conbody>
   </concept>

   <concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696">
     <title>Avoiding CPU Hotspots for HDFS Cached Data</title>
     <conbody>
       <p>
         You can use the HDFS caching feature, described in <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>,
         with Impala to reduce I/O and memory-to-memory copying for frequently accessed tables or partitions.
       </p>
       <p>
         In the early days of this feature, you might have found that enabling HDFS caching
         resulted in little or no performance improvement, because it could result in
         <q>hotspots</q>: instead of the I/O to read the table data being parallelized across
         the cluster, the I/O was reduced but the CPU load to process the data blocks
         might be concentrated on a single host.
       </p>
       <p>
         To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> clause with the
         <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that use HDFS caching.
         This clause allows more than one host to cache the relevant data blocks, so the CPU load
         can be shared, reducing the load on any one host.
         See <xref href="impala_create_table.xml#create_table"/> and <xref href="impala_alter_table.xml#alter_table"/>
         for details.
       </p>
       <p>
         Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to
         the way that Impala schedules the work of processing data blocks on different hosts.
         In <keyword keyref="impala25_full"/> and higher, scheduling improvements mean that the work for
         HDFS cached data is divided better among all the hosts that have cached replicas
         for a particular data block. When more than one host has a cached replica for a data block,
         Impala assigns the work of processing that block to whichever host has done the least work
         (in terms of number of bytes read) for the current query. If hotspots persist even with this
         load-based scheduling algorithm, you can enable the query option <codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph>
         to further distribute the CPU load. This setting causes Impala to randomly pick a host to process a cached
         data block if the scheduling algorithm encounters a tie when deciding which host has done the
         least work.
       </p>
     </conbody>
   </concept>

 </concept>