| <?xml version="1.0" encoding="UTF-8"?><!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="scalability"> |
| |
| <title>Scalability Considerations for Impala</title> |
| <titlealts audience="PDF"><navtitle>Scalability Considerations</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Performance"/> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Planning"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Memory"/> |
| <data name="Category" value="Scalability"/> |
| <!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark as 'Proof of Concept'. --> |
| <data name="Category" value="Proof of Concept"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| This section explains how the size of your cluster and the volume of data influences SQL performance and |
| schema design for Impala tables. Typically, adding more cluster capacity reduces problems due to memory |
| limits or disk throughput. On the other hand, larger clusters are more likely to have other kinds of |
| scalability issues, such as a single slow node that causes performance problems for queries. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| <p conref="../shared/impala_common.xml#common/cookbook_blurb"/> |
| |
| </conbody> |
| |
| <concept audience="hidden" id="scalability_memory"> |
| |
| <title>Overview and Guidelines for Impala Memory Usage</title> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Memory"/> |
| <data name="Category" value="Concepts"/> |
| <data name="Category" value="Best Practices"/> |
| <data name="Category" value="Guidelines"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <!-- |
| Outline adapted from Alan Choi's "best practices" and/or "performance cookbook" papers. |
| --> |
| |
| <codeblock>Memory Usage – the Basics |
| * Memory is used by: |
| * Hash join – RHS tables after decompression, filtering and projection |
| * Group by – proportional to the #groups |
| * Parquet writer buffer – 1GB per partition |
| * IO buffer (shared across queries) |
| * Metadata cache (no more than 1GB typically) |
| * Memory held and reused by later query |
| * Impala releases memory from time to time starting in 1.4. |
| |
| Memory Usage – Estimating Memory Usage |
| * Use Explain Plan |
| * Requires statistics! Mem estimate without stats is meaningless. |
| * Reports per-host memory requirement for this cluster size. |
| * Re-run if you’ve re-sized the cluster! |
| [image of explain plan] |
| |
| Memory Usage – Estimating Memory Usage |
| * EXPLAIN’s memory estimate issues |
| * Can be way off – much higher or much lower. |
| * group by’s estimate can be particularly off – when there’s a large number of group by columns. |
| * Mem estimate = NDV of group by column 1 * NDV of group by column 2 * ... NDV of group by column n |
| * Ignore EXPLAIN’s estimate if it’s too high! • Do your own estimate for group by |
| * GROUP BY mem usage = (total number of groups * size of each row) + (total number of groups * size of each row) / num node |
| |
| Memory Usage – Finding Actual Memory Usage |
| * Search for “Per Node Peak Memory Usage” in the profile. |
| This is accurate. Use it for production capacity planning. |
| |
| Memory Usage – Actual Memory Usage |
| * For complex queries, how do I know which part of my query is using too much memory? |
| * Use the ExecSummary from the query profile! |
| - But is that "Peak Mem" number aggregate or per-node? |
| [image of executive summary] |
| |
| Memory Usage – Hitting Mem-limit |
| * Top causes (in order) of hitting mem-limit even when running a single query: |
| 1. Lack of statistics |
| 2. Lots of joins within a single query |
| 3. Big-table joining big-table |
| 4. Gigantic group by |
| |
| Memory Usage – Hitting Mem-limit |
| Lack of stats |
| * Wrong join order, wrong join strategy, wrong insert strategy |
| * Explain Plan tells you that! |
| [image of explain plan] |
| * Fix: Compute Stats table |
| |
| Memory Usage – Hitting Mem-limit |
| Lots of joins within a single query |
| * select...from fact, dim1, dim2,dim3,...dimN where ... |
| * Each dim tbl can fit in memory, but not all of them together |
| * As of Impala 1.4, Impala might choose the wrong plan – BROADCAST |
| FIX 1: use shuffle hint |
| select ... from fact join [shuffle] dim1 on ... join dim2 [shuffle] ... |
| FIX 2: pre-join the dim tables (if possible) |
| - How about an example to illustrate that technique? |
| * few join=>better perf! |
| |
| Memory Usage: Hitting Mem-limit |
| Big-table joining big-table |
| * Big-table (after decompression, filtering, and projection) is a table that is bigger than total cluster memory size. |
| * Impala 2.0 will do this (via disk-based join). Consider using Hive for now. |
| * (Advanced) For a simple query, you can try this advanced workaround – per-partition join |
| * Requires the partition key be part of the join key |
| select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3) |
| union all |
| select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6) |
| |
| Memory Usage: Hitting Mem-limit |
| Gigantic group by |
| * The total number of distinct groups is huge, such as group by userid. |
| * Impala 2.0 will do this (via disk-based agg). Consider using Hive for now. |
| - Is this one of the cases where people were unhappy we recommended Hive? |
| * (Advanced) For a simple query, you can try this advanced workaround – per-partition agg |
| * Requires the partition key be part of the group by |
| select part_key, col1, col2, ...agg(..) from tbl where |
| part_key in (1,2,3) |
| Union all |
| Select part_key, col1, col2, ...agg(..) from tbl where |
| part_key in (4,5,6) |
| - But where's the GROUP BY in the preceding query? Need a real example. |
| |
| Memory Usage: Additional Notes |
| * Use explain plan for estimate; use profile for accurate measure |
| * Data skew can use uneven memory usage |
| * Review previous common issues on out-of-memory |
| * Note: Even with disk-based joins, you'll want to review these steps to speed up queries and use memory more efficiently |
| </codeblock> |
| </conbody> |
| </concept> |
| |
| <concept id="scalability_catalog"> |
| |
| <title>Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage</title> |
| |
| <conbody> |
| |
| <p audience="hidden"> |
| Details to fill in in future: Impact of <q>load catalog in background</q> option. |
| Changing timeouts. |
| </p> |
| |
| <p> |
| Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized for tables |
| containing relatively few, large data files. Schemas containing thousands of tables, or tables containing |
| thousands of partitions, can encounter performance issues during startup or during DDL operations such as |
| <codeph>ALTER TABLE</codeph> statements. |
| </p> |
| |
| <note type="important" rev="TSB-168"> |
| <p> |
| Because of a change in the default heap size for the <cmdname>catalogd</cmdname> daemon in |
| <keyword keyref="impala25_full"/> and higher, the following procedure to increase the <cmdname>catalogd</cmdname> |
| memory limit might be required following an upgrade to <keyword keyref="impala25_full"/> even if not |
| needed previously. |
| </p> |
| </note> |
| |
| <p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size"/> |
| |
| </conbody> |
| </concept> |
| |
| <concept rev="2.1.0" id="statestore_scalability"> |
| |
| <title>Scalability Considerations for the Impala Statestore</title> |
| |
| <conbody> |
| |
| <p> |
| Before <keyword keyref="impala21_full"/>, the statestore sent only one kind of message to its subscribers. This message contained all |
| updates for any topics that a subscriber had subscribed to. It also served to let subscribers know that the |
| statestore had not failed, and conversely the statestore used the success of sending a heartbeat to a |
| subscriber to decide whether or not the subscriber had failed. |
| </p> |
| |
| <p> |
| Combining topic updates and failure detection in a single message led to bottlenecks in clusters with large |
| numbers of tables, partitions, and HDFS data blocks. When the statestore was overloaded with metadata |
| updates to transmit, heartbeat messages were sent less frequently, sometimes causing subscribers to time |
| out their connection with the statestore. Increasing the subscriber timeout and decreasing the frequency of |
| statestore heartbeats worked around the problem, but reduced responsiveness when the statestore failed or |
| restarted. |
| </p> |
| |
| <p> |
| As of <keyword keyref="impala21_full"/>, the statestore now sends topic updates and heartbeats in separate messages. This allows the |
| statestore to send and receive a steady stream of lightweight heartbeats, and removes the requirement to |
| send topic updates according to a fixed schedule, reducing statestore network overhead. |
| </p> |
| |
| <p> |
| The statestore now has the following relevant configuration flags for the <cmdname>statestored</cmdname> |
| daemon: |
| </p> |
| |
| <dl> |
| <dlentry id="statestore_num_update_threads"> |
| |
| <dt> |
| <codeph>-statestore_num_update_threads</codeph> |
| </dt> |
| |
| <dd> |
| The number of threads inside the statestore dedicated to sending topic updates. You should not |
| typically need to change this value. |
| <p> |
| <b>Default:</b> 10 |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry id="statestore_update_frequency_ms"> |
| |
| <dt> |
| <codeph>-statestore_update_frequency_ms</codeph> |
| </dt> |
| |
| <dd> |
| The frequency, in milliseconds, with which the statestore tries to send topic updates to each |
| subscriber. This is a best-effort value; if the statestore is unable to meet this frequency, it sends |
| topic updates as fast as it can. You should not typically need to change this value. |
| <p> |
| <b>Default:</b> 2000 |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry id="statestore_num_heartbeat_threads"> |
| |
| <dt> |
| <codeph>-statestore_num_heartbeat_threads</codeph> |
| </dt> |
| |
| <dd> |
| The number of threads inside the statestore dedicated to sending heartbeats. You should not typically |
| need to change this value. |
| <p> |
| <b>Default:</b> 10 |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry id="statestore_heartbeat_frequency_ms"> |
| |
| <dt> |
| <codeph>-statestore_heartbeat_frequency_ms</codeph> |
| </dt> |
| |
| <dd> |
| The frequency, in milliseconds, with which the statestore tries to send heartbeats to each subscriber. |
| This value should be good for large catalogs and clusters up to approximately 150 nodes. Beyond that, |
| you might need to increase this value to make the interval longer between heartbeat messages. |
| <p> |
| <b>Default:</b> 1000 (one heartbeat message every second) |
| </p> |
| </dd> |
| |
| </dlentry> |
| </dl> |
| |
| <p> |
| If it takes a very long time for a cluster to start up, and <cmdname>impala-shell</cmdname> consistently |
| displays <codeph>This Impala daemon is not ready to accept user requests</codeph>, the statestore might be |
| taking too long to send the entire catalog topic to the cluster. In this case, consider adding |
| <codeph>--load_catalog_in_background=false</codeph> to your catalog service configuration. This setting |
| stops the statestore from loading the entire catalog into memory at cluster startup. Instead, metadata for |
| each table is loaded when the table is accessed for the first time. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept audience="hidden" id="scalability_cluster_size"> |
| |
| <title>Scalability Considerations for Impala Cluster Size and Topology</title> |
| |
| <conbody> |
| |
| <p> |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept audience="hidden" id="concurrent_connections"> |
| |
| <title>Scaling the Number of Concurrent Connections</title> |
| |
| <conbody> |
| |
| <p></p> |
| </conbody> |
| </concept> |
| |
| <concept rev="2.0.0" id="spill_to_disk"> |
| |
| <title>SQL Operations that Spill to Disk</title> |
| |
| <conbody> |
| |
| <p> |
| Certain memory-intensive operations write temporary data to disk (known as <term>spilling</term> to disk) |
| when Impala is close to exceeding its memory limit on a particular host. |
| </p> |
| |
| <p> |
| The result is a query that completes successfully, rather than failing with an out-of-memory error. The |
| tradeoff is decreased performance due to the extra disk I/O to write the temporary data and read it back |
| in. The slowdown could be potentially be significant. Thus, while this feature improves reliability, |
| you should optimize your queries, system parameters, and hardware configuration to make this spilling a rare occurrence. |
| </p> |
| |
| <p> |
| <b>What kinds of queries might spill to disk:</b> |
| </p> |
| |
| <p> |
| Several SQL clauses and constructs require memory allocations that could activat the spilling mechanism: |
| </p> |
| <ul> |
| <li> |
| <p> |
| when a query uses a <codeph>GROUP BY</codeph> clause for columns |
| with millions or billions of distinct values, Impala keeps a |
| similar number of temporary results in memory, to accumulate the |
| aggregate results for each value in the group. |
| </p> |
| </li> |
| <li> |
| <p> |
| When large tables are joined together, Impala keeps the values of |
| the join columns from one table in memory, to compare them to |
| incoming values from the other table. |
| </p> |
| </li> |
| <li> |
| <p> |
| When a large result set is sorted by the <codeph>ORDER BY</codeph> |
| clause, each node sorts its portion of the result set in memory. |
| </p> |
| </li> |
| <li> |
| <p> |
| The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators |
| build in-memory data structures to represent all values found so |
| far, to eliminate duplicates as the query progresses. |
| </p> |
| </li> |
| <!-- JIRA still in open state as of 5.8 / 2.6, commenting out. |
| <li> |
| <p rev="IMPALA-3471"> |
| In <keyword keyref="impala26_full"/> and higher, <term>top-N</term> queries (those with |
| <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses) can also spill. |
| Impala allocates enough memory to hold as many rows as specified by the <codeph>LIMIT</codeph> |
| clause, plus enough memory to hold as many rows as specified by any <codeph>OFFSET</codeph> clause. |
| </p> |
| </li> |
| --> |
| </ul> |
| |
| <p conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/> |
| |
| <p> |
| <b>How Impala handles scratch disk space for spilling:</b> |
| </p> |
| |
| <p rev="obwl" conref="../shared/impala_common.xml#common/order_by_scratch_dir"/> |
| |
| <p> |
| <b>Memory usage for SQL operators:</b> |
| </p> |
| |
| <p> |
| The infrastructure of the spilling feature affects the way the affected SQL operators, such as |
| <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory. |
| On each host that participates in the query, each such operator in a query accumulates memory |
| while building the data structure to process the aggregation or join operation. The amount |
| of memory used depends on the portion of the data being handled by that host, and thus might |
| be different from one host to another. When the amount of memory being used for the operator |
| on a particular host reaches a threshold amount, Impala reserves an additional memory buffer |
| to use as a work area in case that operator causes the query to exceed the memory limit for |
| that host. After allocating the memory buffer, the memory used by that operator remains |
| essentially stable or grows only slowly, until the point where the memory limit is reached |
| and the query begins writing temporary data to disk. |
| </p> |
| |
| <p rev="2.2.0"> |
| Prior to Impala 2.2, the extra memory buffer for an operator that might spill to disk |
| was allocated when the data structure used by the applicable SQL operator reaches 16 MB in size, |
| and the memory buffer itself was 512 MB. In Impala 2.2, these values are halved: the threshold value |
| is 8 MB and the memory buffer is 256 MB. <ph rev="2.3.0">In <keyword keyref="impala23_full"/> and higher, the memory for the buffer |
| is allocated in pieces, only as needed, to avoid sudden large jumps in memory usage.</ph> A query that uses |
| multiple such operators might allocate multiple such memory buffers, as the size of the data structure |
| for each operator crosses the threshold on a particular host. |
| </p> |
| |
| <p> |
| Therefore, a query that processes a relatively small amount of data on each host would likely |
| never reach the threshold for any operator, and would never allocate any extra memory buffers. A query |
| that did process millions of groups, distinct values, join keys, and so on might cross the threshold, |
| causing its memory requirement to rise suddenly and then flatten out. The larger the cluster, less data is processed |
| on any particular host, thus reducing the chance of requiring the extra memory allocation. |
| </p> |
| |
| <p> |
| <b>Added in:</b> This feature was added to the <codeph>ORDER BY</codeph> clause in Impala 1.4. |
| This feature was extended to cover join queries, aggregation functions, and analytic |
| functions in Impala 2.0. The size of the memory work area required by |
| each operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2. |
| </p> |
| |
| <p> |
| <b>Avoiding queries that spill to disk:</b> |
| </p> |
| |
| <p> |
| Because the extra I/O can impose significant performance overhead on these types of queries, try to avoid |
| this situation by using the following steps: |
| </p> |
| |
| <ol> |
| <li> |
| Detect how often queries spill to disk, and how much temporary data is written. Refer to the following |
| sources: |
| <ul> |
| <li> |
| The output of the <codeph>PROFILE</codeph> command in the <cmdname>impala-shell</cmdname> |
| interpreter. This data shows the memory usage for each host and in total across the cluster. The |
| <codeph>BlockMgr.BytesWritten</codeph> counter reports how much data was written to disk during the |
| query. |
| </li> |
| |
| <li> |
| The <uicontrol>Queries</uicontrol> tab in the Impala debug web user interface. Select the query to |
| examine and click the corresponding <uicontrol>Profile</uicontrol> link. This data breaks down the |
| memory usage for a single host within the cluster, the host whose web interface you are connected to. |
| </li> |
| </ul> |
| </li> |
| |
| <li> |
| Use one or more techniques to reduce the possibility of the queries spilling to disk: |
| <ul> |
| <li> |
| Increase the Impala memory limit if practical, for example, if you can increase the available memory |
| by more than the amount of temporary data written to disk on a particular node. Remember that in |
| Impala 2.0 and later, you can issue <codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets you |
| fine-tune the memory usage for queries from JDBC and ODBC applications. |
| </li> |
| |
| <li> |
| Increase the number of nodes in the cluster, to increase the aggregate memory available to Impala and |
| reduce the amount of memory required on each node. |
| </li> |
| |
| <li> |
| Increase the overall memory capacity of each DataNode at the hardware level. |
| </li> |
| |
| <li> |
| On a cluster with resources shared between Impala and other Hadoop components, use resource |
| management features to allocate more memory for Impala. See |
| <xref href="impala_resource_management.xml#resource_management"/> for details. |
| </li> |
| |
| <li> |
| If the memory pressure is due to running many concurrent queries rather than a few memory-intensive |
| ones, consider using the Impala admission control feature to lower the limit on the number of |
| concurrent queries. By spacing out the most resource-intensive queries, you can avoid spikes in |
| memory usage and improve overall response times. See |
| <xref href="impala_admission.xml#admission_control"/> for details. |
| </li> |
| |
| <li> |
| Tune the queries with the highest memory requirements, using one or more of the following techniques: |
| <ul> |
| <li> |
| Run the <codeph>COMPUTE STATS</codeph> statement for all tables involved in large-scale joins and |
| aggregation queries. |
| </li> |
| |
| <li> |
| Minimize your use of <codeph>STRING</codeph> columns in join columns. Prefer numeric values |
| instead. |
| </li> |
| |
| <li> |
| Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy being used for the |
| most resource-intensive queries. See <xref href="impala_explain_plan.xml#perf_explain"/> for |
| details. |
| </li> |
| |
| <li> |
| If Impala still chooses a suboptimal execution strategy even with statistics available, or if it |
| is impractical to keep the statistics up to date for huge or rapidly changing tables, add hints |
| to the most resource-intensive queries to select the right execution strategy. See |
| <xref href="impala_hints.xml#hints"/> for details. |
| </li> |
| </ul> |
| </li> |
| |
| <li> |
| If your queries experience substantial performance overhead due to spilling, enable the |
| <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option prevents queries whose memory usage |
| is likely to be exorbitant from spilling to disk. See |
| <xref href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/> for details. As you tune |
| problematic queries using the preceding steps, fewer and fewer will be cancelled by this option |
| setting. |
| </li> |
| </ul> |
| </li> |
| </ol> |
| |
| <p> |
| <b>Testing performance implications of spilling to disk:</b> |
| </p> |
| |
| <p> |
| To artificially provoke spilling, to test this feature and understand the performance implications, use a |
| test environment with a memory limit of at least 2 GB. Issue the <codeph>SET</codeph> command with no |
| arguments to check the current setting for the <codeph>MEM_LIMIT</codeph> query option. Set the query |
| option <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the spill-to-disk feature to prevent |
| runaway disk usage from queries that are known in advance to be suboptimal. Within |
| <cmdname>impala-shell</cmdname>, run a query that you expect to be memory-intensive, based on the criteria |
| explained earlier. A self-join of a large table is a good candidate: |
| </p> |
| |
| <codeblock>select count(*) from big_table a join big_table b using (column_with_many_values); |
| </codeblock> |
| |
| <p> |
| Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown of the memory usage on each node |
| during the query. The crucial part of the profile output concerning memory is the <codeph>BlockMgr</codeph> |
| portion. For example, this profile shows that the query did not quite exceed the memory limit. |
| </p> |
| |
| <codeblock>BlockMgr: |
| - BlockWritesIssued: 1 |
| - BlockWritesOutstanding: 0 |
| - BlocksCreated: 24 |
| - BlocksRecycled: 1 |
| - BufferedPins: 0 |
| - MaxBlockSize: 8.00 MB (8388608) |
| <b>- MemoryLimit: 200.00 MB (209715200)</b> |
| <b>- PeakMemoryUsage: 192.22 MB (201555968)</b> |
| - TotalBufferWaitTime: 0ns |
| - TotalEncryptionTime: 0ns |
| - TotalIntegrityCheckTime: 0ns |
| - TotalReadBlockTime: 0ns |
| </codeblock> |
| |
| <p> |
| In this case, because the memory limit was already below any recommended value, I increased the volume of |
| data for the query rather than reducing the memory limit any further. |
| </p> |
| |
| <p> |
| Set the <codeph>MEM_LIMIT</codeph> query option to a value that is smaller than the peak memory usage |
| reported in the profile output. Do not specify a memory limit lower than about 300 MB, because with such a |
| low limit, queries could fail to start for other reasons. Now try the memory-intensive query again. |
| </p> |
| |
| <p> |
| Check if the query fails with a message like the following: |
| </p> |
| |
| <codeblock>WARNINGS: Spilling has been disabled for plans that do not have stats and are not hinted |
| to prevent potentially bad plans from using too many cluster resources. Compute stats on |
| these tables, hint the plan or disable this behavior via query options to enable spilling. |
| </codeblock> |
| |
| <p> |
| If so, the query could have consumed substantial temporary disk space, slowing down so much that it would |
| not complete in any reasonable time. Rather than rely on the spill-to-disk feature in this case, issue the |
| <codeph>COMPUTE STATS</codeph> statement for the table or tables in your sample query. Then run the query |
| again, check the peak memory usage again in the <codeph>PROFILE</codeph> output, and adjust the memory |
| limit again if necessary to be lower than the peak memory usage. |
| </p> |
| |
| <p> |
| At this point, you have a query that is memory-intensive, but Impala can optimize it efficiently so that |
| the memory usage is not exorbitant. You have set an artificial constraint through the |
| <codeph>MEM_LIMIT</codeph> option so that the query would normally fail with an out-of-memory error. But |
| the automatic spill-to-disk feature means that the query should actually succeed, at the expense of some |
| extra disk I/O to read and write temporary work data. |
| </p> |
| |
| <p> |
| Try the query again, and confirm that it succeeds. Examine the <codeph>PROFILE</codeph> output again. This |
| time, look for lines of this form: |
| </p> |
| |
| <codeblock>- SpilledPartitions: <varname>N</varname> |
| </codeblock> |
| |
| <p> |
| If you see any such lines with <varname>N</varname> greater than 0, that indicates the query would have |
| failed in Impala releases prior to 2.0, but now it succeeded because of the spill-to-disk feature. Examine |
| the total time taken by the <codeph>AGGREGATION_NODE</codeph> or other query fragments containing non-zero |
| <codeph>SpilledPartitions</codeph> values. Compare the times to similar fragments that did not spill, for |
| example in the <codeph>PROFILE</codeph> output when the same query is run with a higher memory limit. This |
| gives you an idea of the performance penalty of the spill operation for a particular query with a |
| particular memory limit. If you make the memory limit just a little lower than the peak memory usage, the |
| query only needs to write a small amount of temporary data to disk. The lower you set the memory limit, the |
| more temporary data is written and the slower the query becomes. |
| </p> |
| |
| <p> |
| Now repeat this procedure for actual queries used in your environment. Use the |
| <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where queries used more memory than |
| necessary due to lack of statistics on the relevant tables and columns, and issue <codeph>COMPUTE |
| STATS</codeph> where necessary. |
| </p> |
| |
| <p> |
| <b>When to use DISABLE_UNSAFE_SPILLS:</b> |
| </p> |
| |
| <p> |
| You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned on all the time. Whether and |
| how frequently to use this option depends on your system environment and workload. |
| </p> |
| |
| <p> |
| <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment with ad hoc queries whose performance |
| characteristics and memory usage are not known in advance. It prevents <q>worst-case scenario</q> queries |
| that use large amounts of memory unnecessarily. Thus, you might turn this option on within a session while |
| developing new SQL code, even though it is turned off for existing applications. |
| </p> |
| |
| <p> |
| Organizations where table and column statistics are generally up-to-date might leave this option turned on |
| all the time, again to avoid worst-case scenarios for untested queries or if a problem in the ETL pipeline |
| results in a table with no statistics. Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail |
| fast</q> in this case and immediately gather statistics or tune the problematic queries. |
| </p> |
| |
| <p> |
| Some organizations might leave this option turned off. For example, you might have tables large enough that |
| the <codeph>COMPUTE STATS</codeph> takes substantial time to run, making it impractical to re-run after |
| loading new data. If you have examined the <codeph>EXPLAIN</codeph> plans of your queries and know that |
| they are operating efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that |
| case, you know that any queries that spill will not go overboard with their memory consumption. |
| </p> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="complex_query"> |
| <title>Limits on Query Size and Complexity</title> |
| <conbody> |
| <p> |
| There are hardcoded limits on the maximum size and complexity of queries. |
| Currently, the maximum number of expressions in a query is 2000. |
| You might exceed the limits with large or deeply nested queries |
| produced by business intelligence tools or other query generators. |
| </p> |
| <p> |
| If you have the ability to customize such queries or the query generation |
| logic that produces them, replace sequences of repetitive expressions |
| with single operators such as <codeph>IN</codeph> or <codeph>BETWEEN</codeph> |
| that can represent multiple values or ranges. |
| For example, instead of a large number of <codeph>OR</codeph> clauses: |
| </p> |
| <codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ... |
| </codeblock> |
| <p> |
| use a single <codeph>IN</codeph> clause: |
| </p> |
| <codeblock>WHERE val IN (1,2,6,100,...)</codeblock> |
| </conbody> |
| </concept> |
| |
| <concept id="scalability_io"> |
| <title>Scalability Considerations for Impala I/O</title> |
| <conbody> |
| <p> |
| Impala parallelizes its I/O operations aggressively, |
| therefore the more disks you can attach to each host, the better. |
| Impala retrieves data from disk so quickly using |
| bulk read operations on large blocks, that most queries |
| are CPU-bound rather than I/O-bound. |
| </p> |
| <p> |
| Because the kind of sequential scanning typically done by |
| Impala queries does not benefit much from the random-access |
| capabilities of SSDs, spinning disks typically provide |
| the most cost-effective kind of storage for Impala data, |
| with little or no performance penalty as compared to SSDs. |
| </p> |
| <p> |
| Resource management features such as YARN, Llama, and admission control |
| typically constrain the amount of memory, CPU, or overall number of |
| queries in a high-concurrency environment. |
| Currently, there is no throttling mechanism for Impala I/O. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="big_tables"> |
| <title>Scalability Considerations for Table Layout</title> |
| <conbody> |
| <p> |
| Due to the overhead of retrieving and updating table metadata |
| in the metastore database, try to limit the number of columns |
| in a table to a maximum of approximately 2000. |
| Although Impala can handle wider tables than this, the metastore overhead |
| can become significant, leading to query performance that is slower |
| than expected based on the actual data volume. |
| </p> |
| <p> |
| To minimize overhead related to the metastore database and Impala query planning, |
| try to limit the number of partitions for any partitioned table to a few tens of thousands. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept rev="" id="kerberos_overhead_cluster_size"> |
| <title>Kerberos-Related Network Overhead for Large Clusters</title> |
| <conbody> |
| <p> |
| When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala sends a number of |
| simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might be able to process |
| all the requests within roughly 5 seconds. For a cluster with 1000 hosts, the time to process |
| the requests would be roughly 500 seconds. Impala also makes a number of DNS requests at the same |
| time as these Kerberos-related requests. |
| </p> |
| <p> |
| While these authentication requests are being processed, any submitted Impala queries will fail. |
| During this period, the KDC and DNS may be slow to respond to requests from components other than Impala, |
| so other secure services might be affected temporarily. |
| </p> |
| |
| <p> |
| To reduce the frequency of the <codeph>kinit</codeph> renewal that initiates |
| a new set of authentication requests, increase the <codeph>kerberos_reinit_interval</codeph> |
| configuration setting for the <cmdname>impalad</cmdname> daemons. Currently, the default is 60 minutes. |
| Consider using a higher value such as 360 (6 hours). |
| </p> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696"> |
| <title>Avoiding CPU Hotspots for HDFS Cached Data</title> |
| <conbody> |
| <p> |
| You can use the HDFS caching feature, described in <xref href="impala_perf_hdfs_caching.xml#hdfs_caching"/>, |
| with Impala to reduce I/O and memory-to-memory copying for frequently accessed tables or partitions. |
| </p> |
| <p> |
| In the early days of this feature, you might have found that enabling HDFS caching |
| resulted in little or no performance improvement, because it could result in |
| <q>hotspots</q>: instead of the I/O to read the table data being parallelized across |
| the cluster, the I/O was reduced but the CPU load to process the data blocks |
| might be concentrated on a single host. |
| </p> |
| <p> |
| To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> clause with the |
| <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that use HDFS caching. |
| This clause allows more than one host to cache the relevant data blocks, so the CPU load |
| can be shared, reducing the load on any one host. |
| See <xref href="impala_create_table.xml#create_table"/> and <xref href="impala_alter_table.xml#alter_table"/> |
| for details. |
| </p> |
| <p> |
| Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to |
| the way that Impala schedules the work of processing data blocks on different hosts. |
| In <keyword keyref="impala25_full"/> and higher, scheduling improvements mean that the work for |
| HDFS cached data is divided better among all the hosts that have cached replicas |
| for a particular data block. When more than one host has a cached replica for a data block, |
| Impala assigns the work of processing that block to whichever host has done the least work |
| (in terms of number of bytes read) for the current query. If hotspots persist even with this |
| load-based scheduling algorithm, you can enable the query option <codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph> |
| to further distribute the CPU load. This setting causes Impala to randomly pick a host to process a cached |
| data block if the scheduling algorithm encounters a tie when deciding which host has done the |
| least work. |
| </p> |
| </conbody> |
| </concept> |
| |
| </concept> |