| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="scalability"> |
| |
| <title>Scalability Considerations for Impala</title> |
| |
| <titlealts audience="PDF"> |
| |
| <navtitle>Scalability Considerations</navtitle> |
| |
| </titlealts> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Performance"/> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Planning"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Memory"/> |
| <data name="Category" value="Scalability"/> |
| <!-- Using domain knowledge about Impala, sizing, etc. to decide what to mark as 'Proof of Concept'. --> |
| <data name="Category" value="Proof of Concept"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| This section explains how the size of your cluster and the volume of data influences SQL |
| performance and schema design for Impala tables. Typically, adding more cluster capacity |
| reduces problems due to memory limits or disk throughput. On the other hand, larger |
| clusters are more likely to have other kinds of scalability issues, such as a single slow |
| node that causes performance problems for queries. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| <p conref="../shared/impala_common.xml#common/cookbook_blurb"/> |
| |
| </conbody> |
| |
| <concept audience="hidden" id="scalability_memory"> |
| |
| <title>Overview and Guidelines for Impala Memory Usage</title> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Memory"/> |
| <data name="Category" value="Concepts"/> |
| <data name="Category" value="Best Practices"/> |
| <data name="Category" value="Guidelines"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <!-- |
| Outline adapted from Alan Choi's "best practices" and/or "performance cookbook" papers. |
| --> |
| |
| <codeblock>Memory Usage – the Basics |
| * Memory is used by: |
| * Hash join – RHS tables after decompression, filtering and projection |
| * Group by – proportional to the #groups |
| * Parquet writer buffer – 1GB per partition |
| * IO buffer (shared across queries) |
| * Metadata cache (no more than 1GB typically) |
| * Memory held and reused by later query |
| * Impala releases memory from time to time starting in 1.4. |
| |
| Memory Usage – Estimating Memory Usage |
| * Use Explain Plan |
| * Requires statistics! Mem estimate without stats is meaningless. |
| * Reports per-host memory requirement for this cluster size. |
| * Re-run if you’ve re-sized the cluster! |
| [image of explain plan] |
| |
| Memory Usage – Estimating Memory Usage |
| * EXPLAIN’s memory estimate issues |
| * Can be way off – much higher or much lower. |
| * group by’s estimate can be particularly off – when there’s a large number of group by columns. |
| * Mem estimate = NDV of group by column 1 * NDV of group by column 2 * ... NDV of group by column n |
| * Ignore EXPLAIN’s estimate if it’s too high! • Do your own estimate for group by |
| * GROUP BY mem usage = (total number of groups * size of each row) + (total number of groups * size of each row) / num node |
| |
| Memory Usage – Finding Actual Memory Usage |
| * Search for “Per Node Peak Memory Usage” in the profile. |
| This is accurate. Use it for production capacity planning. |
| |
| Memory Usage – Actual Memory Usage |
| * For complex queries, how do I know which part of my query is using too much memory? |
| * Use the ExecSummary from the query profile! |
| - But is that "Peak Mem" number aggregate or per-node? |
| [image of executive summary] |
| |
| Memory Usage – Hitting Mem-limit |
| * Top causes (in order) of hitting mem-limit even when running a single query: |
| 1. Lack of statistics |
| 2. Lots of joins within a single query |
| 3. Big-table joining big-table |
| 4. Gigantic group by |
| |
| Memory Usage – Hitting Mem-limit |
| Lack of stats |
| * Wrong join order, wrong join strategy, wrong insert strategy |
| * Explain Plan tells you that! |
| [image of explain plan] |
| * Fix: Compute Stats table |
| |
| Memory Usage – Hitting Mem-limit |
| Lots of joins within a single query |
| * select...from fact, dim1, dim2,dim3,...dimN where ... |
| * Each dim tbl can fit in memory, but not all of them together |
| * As of Impala 1.4, Impala might choose the wrong plan – BROADCAST |
| FIX 1: use shuffle hint |
| select ... from fact join [shuffle] dim1 on ... join dim2 [shuffle] ... |
| FIX 2: pre-join the dim tables (if possible) |
| - How about an example to illustrate that technique? |
| * few join=>better perf! |
| |
| Memory Usage: Hitting Mem-limit |
| Big-table joining big-table |
| * Big-table (after decompression, filtering, and projection) is a table that is bigger than total cluster memory size. |
| * Impala 2.0 will do this (via disk-based join). Consider using Hive for now. |
| * (Advanced) For a simple query, you can try this advanced workaround – per-partition join |
| * Requires the partition key be part of the join key |
| select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (1,2,3) |
| union all |
| select ... from BigTbl_A a join BigTbl_B b where a.part_key = b.part_key and a.part_key in (4,5,6) |
| |
| Memory Usage: Hitting Mem-limit |
| Gigantic group by |
| * The total number of distinct groups is huge, such as group by userid. |
| * Impala 2.0 will do this (via disk-based agg). Consider using Hive for now. |
| - Is this one of the cases where people were unhappy we recommended Hive? |
| * (Advanced) For a simple query, you can try this advanced workaround – per-partition agg |
| * Requires the partition key be part of the group by |
| select part_key, col1, col2, ...agg(..) from tbl where |
| part_key in (1,2,3) |
| Union all |
| Select part_key, col1, col2, ...agg(..) from tbl where |
| part_key in (4,5,6) |
| - But where's the GROUP BY in the preceding query? Need a real example. |
| |
| Memory Usage: Additional Notes |
| * Use explain plan for estimate; use profile for accurate measure |
| * Data skew can use uneven memory usage |
| * Review previous common issues on out-of-memory |
| * Note: Even with disk-based joins, you'll want to review these steps to speed up queries and use memory more efficiently |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="scalability_catalog"> |
| |
| <title>Impact of Many Tables or Partitions on Impala Catalog Performance and Memory Usage</title> |
| |
| <conbody> |
| |
| <p> |
| Because Hadoop I/O is optimized for reading and writing large files, Impala is optimized |
| for tables containing relatively few, large data files. Schemas containing thousands of |
| tables, or tables containing thousands of partitions, can encounter performance issues |
| during startup or during DDL operations such as <codeph>ALTER TABLE</codeph> statements. |
| </p> |
| |
| <note type="important" rev="TSB-168"> |
| <p> |
| Because of a change in the default heap size for the <cmdname>catalogd</cmdname> |
| daemon in <keyword |
| keyref="impala25_full"/> and higher, the following |
| procedure to increase the <cmdname>catalogd</cmdname> memory limit might be required |
| following an upgrade to <keyword keyref="impala25_full"/> even if not needed |
| previously. |
| </p> |
| </note> |
| |
| <p conref="../shared/impala_common.xml#common/increase_catalogd_heap_size" |
| /> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.1.0" id="statestore_scalability"> |
| |
| <title>Scalability Considerations for the Impala Statestore</title> |
| |
| <conbody> |
| |
| <p> |
| Before <keyword keyref="impala21_full"/>, the statestore sent only one kind of message |
| to its subscribers. This message contained all updates for any topics that a subscriber |
| had subscribed to. It also served to let subscribers know that the statestore had not |
| failed, and conversely the statestore used the success of sending a heartbeat to a |
| subscriber to decide whether or not the subscriber had failed. |
| </p> |
| |
| <p> |
| Combining topic updates and failure detection in a single message led to bottlenecks in |
| clusters with large numbers of tables, partitions, and HDFS data blocks. When the |
| statestore was overloaded with metadata updates to transmit, heartbeat messages were |
| sent less frequently, sometimes causing subscribers to time out their connection with |
| the statestore. Increasing the subscriber timeout and decreasing the frequency of |
| statestore heartbeats worked around the problem, but reduced responsiveness when the |
| statestore failed or restarted. |
| </p> |
| |
| <p> |
| As of <keyword keyref="impala21_full"/>, the statestore now sends topic updates and |
| heartbeats in separate messages. This allows the statestore to send and receive a steady |
| stream of lightweight heartbeats, and removes the requirement to send topic updates |
| according to a fixed schedule, reducing statestore network overhead. |
| </p> |
| |
| <p> |
| The statestore now has the following relevant configuration flags for the |
| <cmdname>statestored</cmdname> daemon: |
| </p> |
| |
| <dl> |
| <dlentry id="statestore_num_update_threads"> |
| |
| <dt> |
| <codeph>-statestore_num_update_threads</codeph> |
| </dt> |
| |
| <dd> |
| The number of threads inside the statestore dedicated to sending topic updates. You |
| should not typically need to change this value. |
| <p> |
| <b>Default:</b> 10 |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry id="statestore_update_frequency_ms"> |
| |
| <dt> |
| <codeph>-statestore_update_frequency_ms</codeph> |
| </dt> |
| |
| <dd> |
| The frequency, in milliseconds, with which the statestore tries to send topic |
| updates to each subscriber. This is a best-effort value; if the statestore is unable |
| to meet this frequency, it sends topic updates as fast as it can. You should not |
| typically need to change this value. |
| <p> |
| <b>Default:</b> 2000 |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry id="statestore_num_heartbeat_threads"> |
| |
| <dt> |
| <codeph>-statestore_num_heartbeat_threads</codeph> |
| </dt> |
| |
| <dd> |
| The number of threads inside the statestore dedicated to sending heartbeats. You |
| should not typically need to change this value. |
| <p> |
| <b>Default:</b> 10 |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry id="statestore_heartbeat_frequency_ms"> |
| |
| <dt> |
| <codeph>-statestore_heartbeat_frequency_ms</codeph> |
| </dt> |
| |
| <dd> |
| The frequency, in milliseconds, with which the statestore tries to send heartbeats |
| to each subscriber. This value should be good for large catalogs and clusters up to |
| approximately 150 nodes. Beyond that, you might need to increase this value to make |
| the interval longer between heartbeat messages. |
| <p> |
| <b>Default:</b> 1000 (one heartbeat message every second) |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry id="statestore_heartbeat_tcp_timeout_seconds"> |
| |
| <dt> |
| <codeph>-statestore_heartbeat_tcp_timeout_seconds</codeph> |
| </dt> |
| |
| <dd> |
| The time after which a heartbeat RPC to a subscriber will timeout. This setting |
| protects against badly hung machines that are not able to respond to the heartbeat |
| RPC in short order. Increase this if there are intermittent heartbeat RPC timeouts |
| shown in statestore's log. You can reference the max value of |
| "statestore.priority-topic-update-durations" metric on statestore to get a |
| reasonable value. Note that priority topic updates are assumed to be small amounts |
| of data that take a small amount of time to process (similar to the heartbeat |
| complexity). |
| <p> |
| <b>Default:</b> 3 |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry id="statestore_max_missed_heartbeats"> |
| |
| <dt> |
| <codeph>-statestore_max_missed_heartbeats</codeph> |
| </dt> |
| |
| <dd> |
| Maximum number of consecutive heartbeat messages an impalad can miss before being |
| declared failed by the statestore. You should not typically need to change this |
| value. |
| <p> |
| <b>Default:</b> 10 |
| </p> |
| </dd> |
| |
| </dlentry> |
| |
| <dlentry id="statestore_subscriber_timeout_secs"> |
| |
| <dt> |
| <codeph>-statestore_subscriber_timeout_secs</codeph> |
| </dt> |
| |
| <dd> |
| The amount of time (in seconds) that may elapse before the connection with the |
| statestore is considered lost by subscribers (impalad/catalogd). Impalad will |
| reregister itself to statestore, which may cause its absence in the next round of |
| cluster membership update. This will cause query failures like "Cancelled due to |
| unreachable impalad(s)". The value of this flag should be comparable to |
| <codeph> |
| (statestore_heartbeat_frequency_ms / 1000 + statestore_heartbeat_tcp_timeout_seconds) |
| * statestore_max_missed_heartbeats</codeph>, |
| so subscribers won't reregister themselves too early and allow statestore to |
| resend heartbeats. You can also reference the max value of |
| "statestore-subscriber.heartbeat-interval-time" metrics on impalads to get a |
| reasonable value. |
| <p> |
| <b>Default:</b> 30 |
| </p> |
| </dd> |
| |
| </dlentry> |
| </dl> |
| |
| <p> |
| If it takes a very long time for a cluster to start up, and |
| <cmdname>impala-shell</cmdname> consistently displays <codeph>This Impala daemon is not |
| ready to accept user requests</codeph>, the statestore might be taking too long to send |
| the entire catalog topic to the cluster. In this case, consider adding |
| <codeph>--load_catalog_in_background=false</codeph> to your catalog service |
| configuration. This setting stops the statestore from loading the entire catalog into |
| memory at cluster startup. Instead, metadata for each table is loaded when the table is |
| accessed for the first time. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="scalability_buffer_pool" rev="2.10.0 IMPALA-3200"> |
| |
| <title>Effect of Buffer Pool on Memory Usage (<keyword keyref="impala210"/> and higher)</title> |
| |
| <conbody> |
| |
| <p> |
| The buffer pool feature, available in <keyword keyref="impala210"/> and higher, changes |
| the way Impala allocates memory during a query. Most of the memory needed is reserved at |
| the beginning of the query, avoiding cases where a query might run for a long time |
| before failing with an out-of-memory error. The actual memory estimates and memory |
| buffers are typically smaller than before, so that more queries can run concurrently or |
| process larger volumes of data than previously. |
| </p> |
| |
| <p> |
| The buffer pool feature includes some query options that you can fine-tune: |
| <xref keyref="buffer_pool_limit"/>, |
| <xref |
| keyref="default_spillable_buffer_size"/>, |
| <xref keyref="max_row_size" |
| />, and <xref keyref="min_spillable_buffer_size"/>. |
| </p> |
| |
| <p> |
| Most of the effects of the buffer pool are transparent to you as an Impala user. Memory |
| use during spilling is now steadier and more predictable, instead of increasing rapidly |
| as more data is spilled to disk. The main change from a user perspective is the need to |
| increase the <codeph>MAX_ROW_SIZE</codeph> query option setting when querying tables |
| with columns containing long strings, many columns, or other combinations of factors |
| that produce very large rows. If Impala encounters rows that are too large to process |
| with the default query option settings, the query fails with an error message suggesting |
| to increase the <codeph>MAX_ROW_SIZE</codeph> setting. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept audience="hidden" id="scalability_cluster_size"> |
| |
| <title>Scalability Considerations for Impala Cluster Size and Topology</title> |
| |
| <conbody> |
| |
| <p/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept audience="hidden" id="concurrent_connections"> |
| |
| <title>Scaling the Number of Concurrent Connections</title> |
| |
| <conbody> |
| |
| <p/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.0.0" id="spill_to_disk"> |
| |
| <title>SQL Operations that Spill to Disk</title> |
| |
| <conbody> |
| |
| <p> |
| Certain memory-intensive operations write temporary data to disk (known as |
| <term>spilling</term> to disk) when Impala is close to exceeding its memory limit on a |
| particular host. |
| </p> |
| |
| <p> |
| The result is a query that completes successfully, rather than failing with an |
| out-of-memory error. The tradeoff is decreased performance due to the extra disk I/O to |
| write the temporary data and read it back in. The slowdown could be potentially be |
| significant. Thus, while this feature improves reliability, you should optimize your |
| queries, system parameters, and hardware configuration to make this spilling a rare |
| occurrence. |
| </p> |
| |
| <note rev="2.10.0 IMPALA-3200"> |
| <p> |
| In <keyword keyref="impala210"/> and higher, also see |
| <xref |
| keyref="scalability_buffer_pool"/> for changes to Impala memory |
| allocation that might change the details of which queries spill to disk, and how much |
| memory and disk space is involved in the spilling operation. |
| </p> |
| </note> |
| |
| <p> |
| <b>What kinds of queries might spill to disk:</b> |
| </p> |
| |
| <p> |
| Several SQL clauses and constructs require memory allocations that could activat the |
| spilling mechanism: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| when a query uses a <codeph>GROUP BY</codeph> clause for columns with millions or |
| billions of distinct values, Impala keeps a similar number of temporary results in |
| memory, to accumulate the aggregate results for each value in the group. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| When large tables are joined together, Impala keeps the values of the join columns |
| from one table in memory, to compare them to incoming values from the other table. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| When a large result set is sorted by the <codeph>ORDER BY</codeph> clause, each node |
| sorts its portion of the result set in memory. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| The <codeph>DISTINCT</codeph> and <codeph>UNION</codeph> operators build in-memory |
| data structures to represent all values found so far, to eliminate duplicates as the |
| query progresses. |
| </p> |
| </li> |
| |
| <!-- JIRA still in open state as of 5.8 / 2.6, commenting out. |
| <li> |
| <p rev="IMPALA-3471"> |
| In <keyword keyref="impala26_full"/> and higher, <term>top-N</term> queries (those with |
| <codeph>ORDER BY</codeph> and <codeph>LIMIT</codeph> clauses) can also spill. |
| Impala allocates enough memory to hold as many rows as specified by the <codeph>LIMIT</codeph> |
| clause, plus enough memory to hold as many rows as specified by any <codeph>OFFSET</codeph> clause. |
| </p> |
| </li> |
| --> |
| </ul> |
| |
| <p |
| conref="../shared/impala_common.xml#common/spill_to_disk_vs_dynamic_partition_pruning"/> |
| |
| <p> |
| <b>How Impala handles scratch disk space for spilling:</b> |
| </p> |
| |
| <p rev="obwl" |
| conref="../shared/impala_common.xml#common/order_by_scratch_dir"/> |
| |
| <p> |
| <b>Memory usage for SQL operators:</b> |
| </p> |
| |
| <p rev="2.10.0 IMPALA-3200"> |
| In <keyword keyref="impala210_full"/> and higher, the way SQL operators such as |
| <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, transition between |
| using additional memory or activating the spill-to-disk feature is changed. The memory |
| required to spill to disk is reserved up front, and you can examine it in the |
| <codeph>EXPLAIN</codeph> plan when the <codeph>EXPLAIN_LEVEL</codeph> query option is |
| set to 2 or higher. |
| </p> |
| |
| <p> |
| The infrastructure of the spilling feature affects the way the affected SQL operators, |
| such as <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory. On |
| each host that participates in the query, each such operator in a query requires memory |
| to store rows of data and other data structures. Impala reserves a certain amount of |
| memory up front for each operator that supports spill-to-disk that is sufficient to |
| execute the operator. If an operator accumulates more data than can fit in the reserved |
| memory, it can either reserve more memory to continue processing data in memory or start |
| spilling data to temporary scratch files on disk. Thus, operators with spill-to-disk |
| support can adapt to different memory constraints by using however much memory is |
| available to speed up execution, yet tolerate low memory conditions by spilling data to |
| disk. |
| </p> |
| |
| <p> |
| The amount data depends on the portion of the data being handled by that host, and thus |
| the operator may end up consuming different amounts of memory on different hosts. |
| </p> |
| |
| <!-- |
| <p> |
| The infrastructure of the spilling feature affects the way the affected SQL operators, such as |
| <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and joins, use memory. |
| On each host that participates in the query, each such operator in a query accumulates memory |
| while building the data structure to process the aggregation or join operation. The amount |
| of memory used depends on the portion of the data being handled by that host, and thus might |
| be different from one host to another. When the amount of memory being used for the operator |
| on a particular host reaches a threshold amount, Impala reserves an additional memory buffer |
| to use as a work area in case that operator causes the query to exceed the memory limit for |
| that host. After allocating the memory buffer, the memory used by that operator remains |
| essentially stable or grows only slowly, until the point where the memory limit is reached |
| and the query begins writing temporary data to disk. |
| </p> |
| |
| <p rev="2.2.0"> |
| Prior to Impala 2.2, the extra memory buffer for an operator that might spill to disk |
| was allocated when the data structure used by the applicable SQL operator reaches 16 MB in size, |
| and the memory buffer itself was 512 MB. In Impala 2.2, these values are halved: the threshold value |
| is 8 MB and the memory buffer is 256 MB. <ph rev="2.3.0">In <keyword keyref="impala23_full"/> and higher, the memory for the buffer |
| is allocated in pieces, only as needed, to avoid sudden large jumps in memory usage.</ph> A query that uses |
| multiple such operators might allocate multiple such memory buffers, as the size of the data structure |
| for each operator crosses the threshold on a particular host. |
| </p> |
| |
| <p> |
| Therefore, a query that processes a relatively small amount of data on each host would likely |
| never reach the threshold for any operator, and would never allocate any extra memory buffers. A query |
| that did process millions of groups, distinct values, join keys, and so on might cross the threshold, |
| causing its memory requirement to rise suddenly and then flatten out. The larger the cluster, less data is processed |
| on any particular host, thus reducing the chance of requiring the extra memory allocation. |
| </p> |
| --> |
| |
| <p> |
| <b>Added in:</b> This feature was added to the <codeph>ORDER BY</codeph> clause in |
| Impala 1.4. This feature was extended to cover join queries, aggregation functions, and |
| analytic functions in Impala 2.0. The size of the memory work area required by each |
| operator that spills was reduced from 512 megabytes to 256 megabytes in Impala 2.2. |
| <ph |
| rev="2.10.0 IMPALA-3200">The spilling mechanism was reworked to take |
| advantage of the Impala buffer pool feature and be more predictable and stable in |
| <keyword keyref="impala210_full"/>.</ph> |
| </p> |
| |
| <p> |
| <b>Avoiding queries that spill to disk:</b> |
| </p> |
| |
| <p> |
| Because the extra I/O can impose significant performance overhead on these types of |
| queries, try to avoid this situation by using the following steps: |
| </p> |
| |
| <ol> |
| <li> |
| Detect how often queries spill to disk, and how much temporary data is written. Refer |
| to the following sources: |
| <ul> |
| <li> |
| The output of the <codeph>PROFILE</codeph> command in the |
| <cmdname>impala-shell</cmdname> interpreter. This data shows the memory usage for |
| each host and in total across the cluster. The <codeph>WriteIoBytes</codeph> |
| counter reports how much data was written to disk for each operator during the |
| query. (In <keyword |
| keyref="impala29_full"/>, the counter was |
| named <codeph>ScratchBytesWritten</codeph>; in |
| <keyword |
| keyref="impala28_full"/> and earlier, it was named |
| <codeph>BytesWritten</codeph>.) |
| </li> |
| |
| <li> |
| The <uicontrol>Queries</uicontrol> tab in the Impala debug web user interface. |
| Select the query to examine and click the corresponding |
| <uicontrol>Profile</uicontrol> link. This data breaks down the memory usage for a |
| single host within the cluster, the host whose web interface you are connected to. |
| </li> |
| </ul> |
| </li> |
| |
| <li> |
| Use one or more techniques to reduce the possibility of the queries spilling to disk: |
| <ul> |
| <li> |
| Increase the Impala memory limit if practical, for example, if you can increase |
| the available memory by more than the amount of temporary data written to disk on |
| a particular node. Remember that in Impala 2.0 and later, you can issue |
| <codeph>SET MEM_LIMIT</codeph> as a SQL statement, which lets you fine-tune the |
| memory usage for queries from JDBC and ODBC applications. |
| </li> |
| |
| <li> |
| Increase the number of nodes in the cluster, to increase the aggregate memory |
| available to Impala and reduce the amount of memory required on each node. |
| </li> |
| |
| <li> |
| Add more memory to the hosts running Impala daemons. |
| </li> |
| |
| <li> |
| On a cluster with resources shared between Impala and other Hadoop components, use |
| resource management features to allocate more memory for Impala. See |
| <xref |
| href="impala_resource_management.xml#resource_management"/> |
| for details. |
| </li> |
| |
| <li> |
| If the memory pressure is due to running many concurrent queries rather than a few |
| memory-intensive ones, consider using the Impala admission control feature to |
| lower the limit on the number of concurrent queries. By spacing out the most |
| resource-intensive queries, you can avoid spikes in memory usage and improve |
| overall response times. See |
| <xref |
| href="impala_admission.xml#admission_control"/> for details. |
| </li> |
| |
| <li> |
| Tune the queries with the highest memory requirements, using one or more of the |
| following techniques: |
| <ul> |
| <li> |
| Run the <codeph>COMPUTE STATS</codeph> statement for all tables involved in |
| large-scale joins and aggregation queries. |
| </li> |
| |
| <li> |
| Minimize your use of <codeph>STRING</codeph> columns in join columns. Prefer |
| numeric values instead. |
| </li> |
| |
| <li> |
| Examine the <codeph>EXPLAIN</codeph> plan to understand the execution strategy |
| being used for the most resource-intensive queries. See |
| <xref href="impala_explain_plan.xml#perf_explain" |
| /> for |
| details. |
| </li> |
| |
| <li> |
| If Impala still chooses a suboptimal execution strategy even with statistics |
| available, or if it is impractical to keep the statistics up to date for huge |
| or rapidly changing tables, add hints to the most resource-intensive queries |
| to select the right execution strategy. See |
| <xref |
| href="impala_hints.xml#hints"/> for details. |
| </li> |
| </ul> |
| </li> |
| |
| <li> |
| If your queries experience substantial performance overhead due to spilling, |
| enable the <codeph>DISABLE_UNSAFE_SPILLS</codeph> query option. This option |
| prevents queries whose memory usage is likely to be exorbitant from spilling to |
| disk. See |
| <xref |
| href="impala_disable_unsafe_spills.xml#disable_unsafe_spills"/> |
| for details. As you tune problematic queries using the preceding steps, fewer and |
| fewer will be cancelled by this option setting. |
| </li> |
| </ul> |
| </li> |
| </ol> |
| |
| <p> |
| <b>Testing performance implications of spilling to disk:</b> |
| </p> |
| |
| <p> |
| To artificially provoke spilling, to test this feature and understand the performance |
| implications, use a test environment with a memory limit of at least 2 GB. Issue the |
| <codeph>SET</codeph> command with no arguments to check the current setting for the |
| <codeph>MEM_LIMIT</codeph> query option. Set the query option |
| <codeph>DISABLE_UNSAFE_SPILLS=true</codeph>. This option limits the spill-to-disk |
| feature to prevent runaway disk usage from queries that are known in advance to be |
| suboptimal. Within <cmdname>impala-shell</cmdname>, run a query that you expect to be |
| memory-intensive, based on the criteria explained earlier. A self-join of a large table |
| is a good candidate: |
| </p> |
| |
| <codeblock>select count(*) from big_table a join big_table b using (column_with_many_values); |
| </codeblock> |
| |
| <p> |
| Issue the <codeph>PROFILE</codeph> command to get a detailed breakdown of the memory |
| usage on each node during the query. |
| <!-- |
| The crucial part of the profile output concerning memory is the <codeph>BlockMgr</codeph> |
| portion. For example, this profile shows that the query did not quite exceed the memory limit. |
| --> |
| </p> |
| |
| <!-- Commenting out because now stale due to changes from the buffer pool (IMPALA-3200). |
| To do: Revisit these details later if indicated by user feedback. |
| |
| <codeblock>BlockMgr: |
| - BlockWritesIssued: 1 |
| - BlockWritesOutstanding: 0 |
| - BlocksCreated: 24 |
| - BlocksRecycled: 1 |
| - BufferedPins: 0 |
| - MaxBlockSize: 8.00 MB (8388608) |
| <b>- MemoryLimit: 200.00 MB (209715200)</b> |
| <b>- PeakMemoryUsage: 192.22 MB (201555968)</b> |
| - TotalBufferWaitTime: 0ns |
| - TotalEncryptionTime: 0ns |
| - TotalIntegrityCheckTime: 0ns |
| - TotalReadBlockTime: 0ns |
| </codeblock> |
| |
| <p> |
| In this case, because the memory limit was already below any recommended value, I increased the volume of |
| data for the query rather than reducing the memory limit any further. |
| </p> |
| --> |
| |
| <p> |
| Set the <codeph>MEM_LIMIT</codeph> query option to a value that is smaller than the peak |
| memory usage reported in the profile output. Now try the memory-intensive query again. |
| </p> |
| |
| <p> |
| Check if the query fails with a message like the following: |
| </p> |
| |
| <codeblock>WARNINGS: Spilling has been disabled for plans that do not have stats and are not hinted |
| to prevent potentially bad plans from using too many cluster resources. Compute stats on |
| these tables, hint the plan or disable this behavior via query options to enable spilling. |
| </codeblock> |
| |
| <p> |
| If so, the query could have consumed substantial temporary disk space, slowing down so |
| much that it would not complete in any reasonable time. Rather than rely on the |
| spill-to-disk feature in this case, issue the <codeph>COMPUTE STATS</codeph> statement |
| for the table or tables in your sample query. Then run the query again, check the peak |
| memory usage again in the <codeph>PROFILE</codeph> output, and adjust the memory limit |
| again if necessary to be lower than the peak memory usage. |
| </p> |
| |
| <p> |
| At this point, you have a query that is memory-intensive, but Impala can optimize it |
| efficiently so that the memory usage is not exorbitant. You have set an artificial |
| constraint through the <codeph>MEM_LIMIT</codeph> option so that the query would |
| normally fail with an out-of-memory error. But the automatic spill-to-disk feature means |
| that the query should actually succeed, at the expense of some extra disk I/O to read |
| and write temporary work data. |
| </p> |
| |
| <p> |
| Try the query again, and confirm that it succeeds. Examine the <codeph>PROFILE</codeph> |
| output again. This time, look for lines of this form: |
| </p> |
| |
| <codeblock>- SpilledPartitions: <varname>N</varname> |
| </codeblock> |
| |
| <p> |
| If you see any such lines with <varname>N</varname> greater than 0, that indicates the |
| query would have failed in Impala releases prior to 2.0, but now it succeeded because of |
| the spill-to-disk feature. Examine the total time taken by the |
| <codeph>AGGREGATION_NODE</codeph> or other query fragments containing non-zero |
| <codeph>SpilledPartitions</codeph> values. Compare the times to similar fragments that |
| did not spill, for example in the <codeph>PROFILE</codeph> output when the same query is |
| run with a higher memory limit. This gives you an idea of the performance penalty of the |
| spill operation for a particular query with a particular memory limit. If you make the |
| memory limit just a little lower than the peak memory usage, the query only needs to |
| write a small amount of temporary data to disk. The lower you set the memory limit, the |
| more temporary data is written and the slower the query becomes. |
| </p> |
| |
| <p> |
| Now repeat this procedure for actual queries used in your environment. Use the |
| <codeph>DISABLE_UNSAFE_SPILLS</codeph> setting to identify cases where queries used more |
| memory than necessary due to lack of statistics on the relevant tables and columns, and |
| issue <codeph>COMPUTE STATS</codeph> where necessary. |
| </p> |
| |
| <p> |
| <b>When to use DISABLE_UNSAFE_SPILLS:</b> |
| </p> |
| |
| <p> |
| You might wonder, why not leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned on all the |
| time. Whether and how frequently to use this option depends on your system environment |
| and workload. |
| </p> |
| |
| <p> |
| <codeph>DISABLE_UNSAFE_SPILLS</codeph> is suitable for an environment with ad hoc |
| queries whose performance characteristics and memory usage are not known in advance. It |
| prevents <q>worst-case scenario</q> queries that use large amounts of memory |
| unnecessarily. Thus, you might turn this option on within a session while developing new |
| SQL code, even though it is turned off for existing applications. |
| </p> |
| |
| <p> |
| Organizations where table and column statistics are generally up-to-date might leave |
| this option turned on all the time, again to avoid worst-case scenarios for untested |
| queries or if a problem in the ETL pipeline results in a table with no statistics. |
| Turning on <codeph>DISABLE_UNSAFE_SPILLS</codeph> lets you <q>fail fast</q> in this case |
| and immediately gather statistics or tune the problematic queries. |
| </p> |
| |
| <p> |
| Some organizations might leave this option turned off. For example, you might have |
| tables large enough that the <codeph>COMPUTE STATS</codeph> takes substantial time to |
| run, making it impractical to re-run after loading new data. If you have examined the |
| <codeph>EXPLAIN</codeph> plans of your queries and know that they are operating |
| efficiently, you might leave <codeph>DISABLE_UNSAFE_SPILLS</codeph> turned off. In that |
| case, you know that any queries that spill will not go overboard with their memory |
| consumption. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="complex_query"> |
| |
| <title>Limits on Query Size and Complexity</title> |
| |
| <conbody> |
| |
| <p> |
| There are hardcoded limits on the maximum size and complexity of queries. Currently, the |
| maximum number of expressions in a query is 2000. You might exceed the limits with large |
| or deeply nested queries produced by business intelligence tools or other query |
| generators. |
| </p> |
| |
| <p> |
| If you have the ability to customize such queries or the query generation logic that |
| produces them, replace sequences of repetitive expressions with single operators such as |
| <codeph>IN</codeph> or <codeph>BETWEEN</codeph> that can represent multiple values or |
| ranges. For example, instead of a large number of <codeph>OR</codeph> clauses: |
| </p> |
| |
| <codeblock>WHERE val = 1 OR val = 2 OR val = 6 OR val = 100 ... |
| </codeblock> |
| |
| <p> |
| use a single <codeph>IN</codeph> clause: |
| </p> |
| |
| <codeblock>WHERE val IN (1,2,6,100,...)</codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="scalability_io"> |
| |
| <title>Scalability Considerations for Impala I/O</title> |
| |
| <conbody> |
| |
| <p> |
| Impala parallelizes its I/O operations aggressively, therefore the more disks you can |
| attach to each host, the better. Impala retrieves data from disk so quickly using bulk |
| read operations on large blocks, that most queries are CPU-bound rather than I/O-bound. |
| </p> |
| |
| <p> |
| Because the kind of sequential scanning typically done by Impala queries does not |
| benefit much from the random-access capabilities of SSDs, spinning disks typically |
| provide the most cost-effective kind of storage for Impala data, with little or no |
| performance penalty as compared to SSDs. |
| </p> |
| |
| <p> |
| Resource management features such as YARN, Llama, and admission control typically |
| constrain the amount of memory, CPU, or overall number of queries in a high-concurrency |
| environment. Currently, there is no throttling mechanism for Impala I/O. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="big_tables"> |
| |
| <title>Scalability Considerations for Table Layout</title> |
| |
| <conbody> |
| |
| <p> |
| Due to the overhead of retrieving and updating table metadata in the metastore database, |
| try to limit the number of columns in a table to a maximum of approximately 2000. |
| Although Impala can handle wider tables than this, the metastore overhead can become |
| significant, leading to query performance that is slower than expected based on the |
| actual data volume. |
| </p> |
| |
| <p> |
| To minimize overhead related to the metastore database and Impala query planning, try to |
| limit the number of partitions for any partitioned table to a few tens of thousands. |
| </p> |
| |
| <p rev="IMPALA-5309"> |
| If the volume of data within a table makes it impractical to run exploratory queries, |
| consider using the <codeph>TABLESAMPLE</codeph> clause to limit query processing to only |
| a percentage of data within the table. This technique reduces the overhead for query |
| startup, I/O to read the data, and the amount of network, CPU, and memory needed to |
| process intermediate results during the query. See <xref keyref="tablesample"/> for |
| details. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="" id="kerberos_overhead_cluster_size"> |
| |
| <title>Kerberos-Related Network Overhead for Large Clusters</title> |
| |
| <conbody> |
| |
| <p> |
| When Impala starts up, or after each <codeph>kinit</codeph> refresh, Impala sends a |
| number of simultaneous requests to the KDC. For a cluster with 100 hosts, the KDC might |
| be able to process all the requests within roughly 5 seconds. For a cluster with 1000 |
| hosts, the time to process the requests would be roughly 500 seconds. Impala also makes |
| a number of DNS requests at the same time as these Kerberos-related requests. |
| </p> |
| |
| <p> |
| While these authentication requests are being processed, any submitted Impala queries |
| will fail. During this period, the KDC and DNS may be slow to respond to requests from |
| components other than Impala, so other secure services might be affected temporarily. |
| </p> |
| |
| <p> |
| In <keyword keyref="impala212_full"/> or earlier, to reduce the frequency of the |
| <codeph>kinit</codeph> renewal that initiates a new set of authentication requests, |
| increase the <codeph>kerberos_reinit_interval</codeph> configuration setting for the |
| <codeph>impalad</codeph> daemons. Currently, the default is 60 minutes. Consider using a |
| higher value such as 360 (6 hours). |
| </p> |
| |
| <p> |
| The <codeph>kerberos_reinit_interval</codeph> configuration setting is removed in |
| <keyword keyref="impala30_full"/>, and the above step is no longer needed. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="scalability_hotspots" rev="2.5.0 IMPALA-2696"> |
| |
| <title>Avoiding CPU Hotspots for HDFS Cached Data</title> |
| |
| <conbody> |
| |
| <p> |
| You can use the HDFS caching feature, described in |
| <xref |
| href="impala_perf_hdfs_caching.xml#hdfs_caching"/>, with Impala to |
| reduce I/O and memory-to-memory copying for frequently accessed tables or partitions. |
| </p> |
| |
| <p> |
| In the early days of this feature, you might have found that enabling HDFS caching |
| resulted in little or no performance improvement, because it could result in |
| <q>hotspots</q>: instead of the I/O to read the table data being parallelized across the |
| cluster, the I/O was reduced but the CPU load to process the data blocks might be |
| concentrated on a single host. |
| </p> |
| |
| <p> |
| To avoid hotspots, include the <codeph>WITH REPLICATION</codeph> clause with the |
| <codeph>CREATE TABLE</codeph> or <codeph>ALTER TABLE</codeph> statements for tables that |
| use HDFS caching. This clause allows more than one host to cache the relevant data |
| blocks, so the CPU load can be shared, reducing the load on any one host. See |
| <xref |
| href="impala_create_table.xml#create_table"/> and |
| <xref |
| href="impala_alter_table.xml#alter_table"/> for details. |
| </p> |
| |
| <p> |
| Hotspots with high CPU load for HDFS cached data could still arise in some cases, due to |
| the way that Impala schedules the work of processing data blocks on different hosts. In |
| <keyword keyref="impala25_full"/> and higher, scheduling improvements mean that the work |
| for HDFS cached data is divided better among all the hosts that have cached replicas for |
| a particular data block. When more than one host has a cached replica for a data block, |
| Impala assigns the work of processing that block to whichever host has done the least |
| work (in terms of number of bytes read) for the current query. If hotspots persist even |
| with this load-based scheduling algorithm, you can enable the query option |
| <codeph>SCHEDULE_RANDOM_REPLICA=TRUE</codeph> to further distribute the CPU load. This |
| setting causes Impala to randomly pick a host to process a cached data block if the |
| scheduling algorithm encounters a tie when deciding which host has done the least work. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="scalability_file_handle_cache" rev="2.10.0 IMPALA-4623"> |
| |
| <title>Scalability Considerations for File Handle Caching</title> |
| |
| <conbody> |
| |
| <p> |
| One scalability aspect that affects heavily loaded clusters is the load on the metadata |
| layer from looking up the details as each file is opened. On HDFS, that can lead to |
| increased load on the NameNode, and on S3, this can lead to an excessive number of S3 |
| metadata requests. For example, a query that does a full table scan on a partitioned |
| table may need to read thousands of partitions, each partition containing multiple data |
| files. Accessing each column of a Parquet file also involves a separate <q>open</q> |
| call, further increasing the load on the NameNode. High NameNode overhead can add |
| startup time (that is, increase latency) to Impala queries, and reduce overall |
| throughput for non-Impala workloads that also require accessing HDFS files. |
| </p> |
| |
| <p> |
| You can reduce the number of calls made to your file system's metadata layer by enabling |
| the file handle caching feature. Data files that are accessed by different queries, or |
| even multiple times within the same query, can be accessed without a new <q>open</q> |
| call and without fetching the file details multiple times. |
| </p> |
| |
| <p> |
| Impala supports file handle caching for the following file systems: |
| <ul> |
| <li> |
| HDFS in <keyword keyref="impala210_full"/> and higher |
| <p> |
| In Impala 3.2 and higher, file handle caching also applies to remote HDFS file |
| handles. This is controlled by the <codeph>cache_remote_file_handles</codeph> flag |
| for an <codeph>impalad</codeph>. It is recommended that you use the default value |
| of <codeph>true</codeph> as this caching prevents your NameNode from overloading |
| when your cluster has many remote HDFS reads. |
| </p> |
| </li> |
| |
| <li> |
| S3 in <keyword keyref="impala33_full"/> and higher |
| <p> |
| The <codeph>cache_s3_file_handles</codeph> <codeph>impalad</codeph> flag controls |
| the S3 file handle caching. The feature is enabled by default with the flag set to |
| <codeph>true</codeph>. |
| </p> |
| </li> |
| </ul> |
| </p> |
| |
| <p> |
| The feature is enabled by default with 20,000 file handles to be cached. To change the |
| value, set the configuration option <codeph>max_cached_file_handles</codeph> to a |
| non-zero value for each <cmdname>impalad</cmdname> daemon. From the initial default |
| value of 20000, adjust upward if NameNode request load is still significant, or downward |
| if it is more important to reduce the extra memory usage on each host. Each cache entry |
| consumes 6 KB, meaning that caching 20,000 file handles requires up to 120 MB on each |
| Impala executor. The exact memory usage varies depending on how many file handles have |
| actually been cached; memory is freed as file handles are evicted from the cache. |
| </p> |
| |
| <p> |
| If a manual operation moves a file to the trashcan while the file handle is cached, |
| Impala still accesses the contents of that file. This is a change from prior behavior. |
| Previously, accessing a file that was in the trashcan would cause an error. This |
| behavior only applies to non-Impala methods of removing files, not the Impala mechanisms |
| such as <codeph>TRUNCATE TABLE</codeph> or <codeph>DROP TABLE</codeph>. |
| </p> |
| |
| <p> |
| If files are removed, replaced, or appended by operations outside of Impala, the way to |
| bring the file information up to date is to run the <codeph>REFRESH</codeph> statement |
| on the table. |
| </p> |
| |
| <p> |
| File handle cache entries are evicted as the cache fills up, or based on a timeout |
| period when they have not been accessed for some time. |
| </p> |
| |
| <p> |
| To evaluate the effectiveness of file handle caching for a particular workload, issue |
| the <codeph>PROFILE</codeph> statement in <cmdname>impala-shell</cmdname> or examine |
| query profiles in the Impala Web UI. Look for the ratio of |
| <codeph>CachedFileHandlesHitCount</codeph> (ideally, should be high) to |
| <codeph>CachedFileHandlesMissCount</codeph> (ideally, should be low). Before starting |
| any evaluation, run several representative queries to <q>warm up</q> the cache because |
| the first time each data file is accessed is always recorded as a cache miss. |
| </p> |
| |
| <p> |
| To see metrics about file handle caching for each <cmdname>impalad</cmdname> instance, |
| examine the following fields on the <uicontrol>/metrics</uicontrol> page in the Impala |
| Web UI: |
| </p> |
| |
| <ul> |
| <li> |
| <uicontrol>impala-server.io.mgr.cached-file-handles-miss-count</uicontrol> |
| </li> |
| |
| <li> |
| <uicontrol>impala-server.io.mgr.num-cached-file-handles</uicontrol> |
| </li> |
| </ul> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |