| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="cluster_sizing"> |
| |
| <title>Cluster Sizing Guidelines for Impala</title> |
| <titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Clusters"/> |
| <data name="Category" value="Planning"/> |
| <data name="Category" value="Sizing"/> |
| <data name="Category" value="Deploying"/> |
| <!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. --> |
| <data name="Category" value="Sectionated Pages"/> |
| <data name="Category" value="Memory"/> |
| <data name="Category" value="Scalability"/> |
| <data name="Category" value="Proof of Concept"/> |
| <data name="Category" value="Requirements"/> |
| <data name="Category" value="Guidelines"/> |
| <data name="Category" value="Best Practices"/> |
| <data name="Category" value="Administrators"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">cluster sizing</indexterm> |
| This document provides a very rough guideline to estimate the size of a cluster needed for a specific |
| customer application. You can use this information when planning how much and what type of hardware to |
| acquire for a new cluster, or when adding Impala workloads to an existing cluster. |
| </p> |
| |
| <note> |
| Before making purchase or deployment decisions, consult organizations with relevant experience |
| to verify the conclusions about hardware requirements based on your data volume and workload. |
| </note> |
| |
| <!-- <p outputclass="toc inpage"/> --> |
| |
| <p> |
| Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently, |
| Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration. |
| Because work can be distributed in different ways for different queries, if some hosts are overloaded |
| compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance |
| and overall slowness |
| </p> |
| |
| <p> |
| For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM, |
| 12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of |
| DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads |
| similar to TPC-DS benchmark queries: |
| </p> |
| |
| <table> |
| <title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title> |
| <tgroup cols="6"> |
| <colspec colnum="1" colname="col1"/> |
| <colspec colnum="2" colname="col2"/> |
| <colspec colnum="3" colname="col3"/> |
| <colspec colnum="4" colname="col4"/> |
| <colspec colnum="5" colname="col5"/> |
| <colspec colnum="6" colname="col6"/> |
| <thead> |
| <row> |
| <entry> |
| Data Size |
| </entry> |
| <entry> |
| 1 query |
| </entry> |
| <entry> |
| 10 queries |
| </entry> |
| <entry> |
| 100 queries |
| </entry> |
| <entry> |
| 1000 queries |
| </entry> |
| <entry> |
| 2000 queries |
| </entry> |
| </row> |
| </thead> |
| <tbody> |
| <row> |
| <entry> |
| <b>250 GB</b> |
| </entry> |
| <entry> |
| 2 |
| </entry> |
| <entry> |
| 2 |
| </entry> |
| <entry> |
| 5 |
| </entry> |
| <entry> |
| 35 |
| </entry> |
| <entry> |
| 70 |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <b>500 GB</b> |
| </entry> |
| <entry> |
| 2 |
| </entry> |
| <entry> |
| 2 |
| </entry> |
| <entry> |
| 10 |
| </entry> |
| <entry> |
| 70 |
| </entry> |
| <entry> |
| 135 |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <b>1 TB</b> |
| </entry> |
| <entry> |
| 2 |
| </entry> |
| <entry> |
| 2 |
| </entry> |
| <entry> |
| 15 |
| </entry> |
| <entry> |
| 135 |
| </entry> |
| <entry> |
| 270 |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <b>15 TB</b> |
| </entry> |
| <entry> |
| 2 |
| </entry> |
| <entry> |
| 20 |
| </entry> |
| <entry> |
| 200 |
| </entry> |
| <entry> |
| N/A |
| </entry> |
| <entry> |
| N/A |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <b>30 TB</b> |
| </entry> |
| <entry> |
| 4 |
| </entry> |
| <entry> |
| 40 |
| </entry> |
| <entry> |
| 400 |
| </entry> |
| <entry> |
| N/A |
| </entry> |
| <entry> |
| N/A |
| </entry> |
| </row> |
| <row> |
| <entry> |
| <b>60 TB</b> |
| </entry> |
| <entry> |
| 8 |
| </entry> |
| <entry> |
| 80 |
| </entry> |
| <entry> |
| 800 |
| </entry> |
| <entry> |
| N/A |
| </entry> |
| <entry> |
| N/A |
| </entry> |
| </row> |
| </tbody> |
| </tgroup> |
| </table> |
| |
| <section id="sizing_factors"> |
| |
| <title>Factors Affecting Scalability</title> |
| |
| <p> |
| A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each |
| node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with |
| cluster size. However, for some workloads, the scalability might be bounded by the network, or even by |
| memory. |
| </p> |
| |
| <p> |
| If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce |
| the network load; in fact, a larger cluster could increase network traffic because some queries involve |
| <q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query |
| throughput in a network-constrained environment. |
| </p> |
| |
| <p> |
| Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional |
| concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor |
| network is saturated yet. This can happen because currently Impala uses only a single core per node to |
| process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system |
| cannot run more than 2 such queries at the same time. |
| </p> |
| |
| <p> |
| Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound |
| workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6 |
| GB/sec. Consider increasing the memory per node. |
| </p> |
| |
| <p> |
| As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the |
| throughput estimate. |
| </p> |
| </section> |
| |
| <section id="sizing_details"> |
| |
| <title>A More Precise Approach</title> |
| |
| <p> |
| A more precise sizing estimate would require not only queries per minute (QPM), but also an average data |
| size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total |
| data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed: |
| </p> |
| |
| <codeblock>Eq 1: N > QPM * D / 100 GB |
| </codeblock> |
| |
| <p> |
| Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is |
| required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can |
| estimate the number of node using our equation above. |
| </p> |
| |
| <codeblock>N > QPM * D / 100GB |
| N > 400 * 50GB / 100GB |
| N > 200 |
| </codeblock> |
| |
| <p> |
| Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500. |
| </p> |
| |
| <p> |
| Depending on the complexity of the query, the processing rate of query might change. If the query has more |
| joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the |
| process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and |
| filtering on numbers, the processing rate can be higher. |
| </p> |
| </section> |
| |
| <section id="sizing_mem_estimate"> |
| |
| <title>Estimating Memory Requirements</title> |
| <!-- |
| <prolog> |
| <metadata> |
| <data name="Category" value="Memory"/> |
| </metadata> |
| </prolog> |
| --> |
| |
| <p> |
| Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the |
| joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE |
| STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps |
| below to calculate the minimum memory requirement. |
| </p> |
| |
| <p> |
| Suppose you are running the following join: |
| </p> |
| |
| <codeblock>select a.*, b.col_1, b.col_2, … b.col_n |
| from a, b |
| where a.key = b.key |
| and b.col_1 in (1,2,4...) |
| and b.col_4 in (....); |
| </codeblock> |
| |
| <p> |
| And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table). |
| </p> |
| |
| <p> |
| The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression, |
| filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less |
| than the total memory of the entire cluster. |
| </p> |
| |
| <codeblock>Cluster Total Memory Requirement = Size of the smaller table * |
| selectivity factor from the predicate * |
| projection factor * compression ratio |
| </codeblock> |
| |
| <p> |
| In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The |
| predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of |
| the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns. |
| Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor. |
| </p> |
| |
| <codeblock>Cluster Total Memory Requirement = Size of the smaller table * |
| selectivity factor from the predicate * |
| projection factor * compression ratio |
| = 100TB * 10% * 5/200 * 3 |
| = 0.75TB |
| = 750GB |
| </codeblock> |
| |
| <p> |
| So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 |
| TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries |
| of this magnitude. |
| </p> |
| </section> |
| </conbody> |
| </concept> |