docs/topics/impala_cluster_sizing.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="cluster_sizing">

   <title>Cluster Sizing Guidelines for Impala</title>
   <titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Clusters"/>
       <data name="Category" value="Planning"/>
       <data name="Category" value="Sizing"/>
       <data name="Category" value="Deploying"/>
       <!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. -->
       <data name="Category" value="Sectionated Pages"/>
       <data name="Category" value="Memory"/>
       <data name="Category" value="Scalability"/>
       <data name="Category" value="Proof of Concept"/>
       <data name="Category" value="Requirements"/>
       <data name="Category" value="Guidelines"/>
       <data name="Category" value="Best Practices"/>
       <data name="Category" value="Administrators"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       <indexterm audience="hidden">cluster sizing</indexterm>
       This document provides a very rough guideline to estimate the size of a cluster needed for a specific
       customer application. You can use this information when planning how much and what type of hardware to
       acquire for a new cluster, or when adding Impala workloads to an existing cluster.
     </p>

     <note>
       Before making purchase or deployment decisions, consult organizations with relevant experience
       to verify the conclusions about hardware requirements based on your data volume and workload.
     </note>

 <!--    <p outputclass="toc inpage"/> -->

     <p>
       Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently,
       Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration.
       Because work can be distributed in different ways for different queries, if some hosts are overloaded
       compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance
       and overall slowness
     </p>

     <p>
       For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM,
       12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of
       DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads
       similar to TPC-DS benchmark queries:
     </p>

     <table>
       <title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title>
       <tgroup cols="6">
         <colspec colnum="1" colname="col1"/>
         <colspec colnum="2" colname="col2"/>
         <colspec colnum="3" colname="col3"/>
         <colspec colnum="4" colname="col4"/>
         <colspec colnum="5" colname="col5"/>
         <colspec colnum="6" colname="col6"/>
         <thead>
           <row>
             <entry>
               Data Size
             </entry>
             <entry>
               1 query
             </entry>
             <entry>
               10 queries
             </entry>
             <entry>
               100 queries
             </entry>
             <entry>
               1000 queries
             </entry>
             <entry>
               2000 queries
             </entry>
           </row>
         </thead>
         <tbody>
           <row>
             <entry>
               <b>250 GB</b>
             </entry>
             <entry>
               2
             </entry>
             <entry>
               2
             </entry>
             <entry>
               5
             </entry>
             <entry>
               35
             </entry>
             <entry>
               70
             </entry>
           </row>
           <row>
             <entry>
               <b>500 GB</b>
             </entry>
             <entry>
               2
             </entry>
             <entry>
               2
             </entry>
             <entry>
               10
             </entry>
             <entry>
               70
             </entry>
             <entry>
               135
             </entry>
           </row>
           <row>
             <entry>
               <b>1 TB</b>
             </entry>
             <entry>
               2
             </entry>
             <entry>
               2
             </entry>
             <entry>
               15
             </entry>
             <entry>
               135
             </entry>
             <entry>
               270
             </entry>
           </row>
           <row>
             <entry>
               <b>15 TB</b>
             </entry>
             <entry>
               2
             </entry>
             <entry>
               20
             </entry>
             <entry>
               200
             </entry>
             <entry>
               N/A
             </entry>
             <entry>
               N/A
             </entry>
           </row>
           <row>
             <entry>
               <b>30 TB</b>
             </entry>
             <entry>
               4
             </entry>
             <entry>
               40
             </entry>
             <entry>
               400
             </entry>
             <entry>
               N/A
             </entry>
             <entry>
               N/A
             </entry>
           </row>
           <row>
             <entry>
               <b>60 TB</b>
             </entry>
             <entry>
               8
             </entry>
             <entry>
               80
             </entry>
             <entry>
               800
             </entry>
             <entry>
               N/A
             </entry>
             <entry>
               N/A
             </entry>
           </row>
         </tbody>
       </tgroup>
     </table>

     <section id="sizing_factors">

       <title>Factors Affecting Scalability</title>

       <p>
         A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each
         node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with
         cluster size. However, for some workloads, the scalability might be bounded by the network, or even by
         memory.
       </p>

       <p>
         If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce
         the network load; in fact, a larger cluster could increase network traffic because some queries involve
         <q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query
         throughput in a network-constrained environment.
       </p>

       <p>
         Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional
         concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor
         network is saturated yet. This can happen because currently Impala uses only a single core per node to
         process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system
         cannot run more than 2 such queries at the same time.
       </p>

       <p>
         Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound
         workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6
         GB/sec. Consider increasing the memory per node.
       </p>

       <p>
         As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the
         throughput estimate.
       </p>
     </section>

     <section id="sizing_details">

       <title>A More Precise Approach</title>

       <p>
         A more precise sizing estimate would require not only queries per minute (QPM), but also an average data
         size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total
         data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed:
       </p>

 <codeblock>Eq 1: N &gt; QPM * D / 100 GB
 </codeblock>

       <p>
         Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is
         required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can
         estimate the number of node using our equation above.
       </p>

 <codeblock>N &gt; QPM * D / 100GB
 N &gt; 400 * 50GB / 100GB
 N &gt; 200
 </codeblock>

       <p>
         Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
       </p>

       <p>
         Depending on the complexity of the query, the processing rate of query might change. If the query has more
         joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the
         process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and
         filtering on numbers, the processing rate can be higher.
       </p>
     </section>

     <section id="sizing_mem_estimate">

       <title>Estimating Memory Requirements</title>
       <!--
   <prolog>
     <metadata>
       <data name="Category" value="Memory"/>
     </metadata>
   </prolog>
       -->

       <p>
         Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the
         joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
         STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps
         below to calculate the minimum memory requirement.
       </p>

       <p>
         Suppose you are running the following join:
       </p>

 <codeblock>select a.*, b.col_1, b.col_2, … b.col_n
 from a, b
 where a.key = b.key
 and b.col_1 in (1,2,4...)
 and b.col_4 in (....);
 </codeblock>

       <p>
         And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table).
       </p>

       <p>
         The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression,
         filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less
         than the total memory of the entire cluster.
       </p>

 <codeblock>Cluster Total Memory Requirement  = Size of the smaller table *
   selectivity factor from the predicate *
   projection factor * compression ratio
 </codeblock>

       <p>
         In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The
         predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of
         the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns.
         Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.
       </p>

 <codeblock>Cluster Total Memory Requirement  = Size of the smaller table *
   selectivity factor from the predicate *
   projection factor * compression ratio
   = 100TB * 10% * 5/200 * 3
   = 0.75TB
   = 750GB
 </codeblock>

       <p>
         So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1
         TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries
         of this magnitude.
       </p>
     </section>
   </conbody>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="cluster_sizing">

	<title>Cluster Sizing Guidelines for Impala</title>
	<titlealts audience="PDF"><navtitle>Cluster Sizing</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Clusters"/>
	<data name="Category" value="Planning"/>
	<data name="Category" value="Sizing"/>
	<data name="Category" value="Deploying"/>
	<!-- Hoist by my own petard. Memory is an important theme of this topic but that's in a <section> title. -->
	<data name="Category" value="Sectionated Pages"/>
	<data name="Category" value="Memory"/>
	<data name="Category" value="Scalability"/>
	<data name="Category" value="Proof of Concept"/>
	<data name="Category" value="Requirements"/>
	<data name="Category" value="Guidelines"/>
	<data name="Category" value="Best Practices"/>
	<data name="Category" value="Administrators"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	<indexterm audience="hidden">cluster sizing</indexterm>
	This document provides a very rough guideline to estimate the size of a cluster needed for a specific
	customer application. You can use this information when planning how much and what type of hardware to
	acquire for a new cluster, or when adding Impala workloads to an existing cluster.
	</p>

	<note>
	Before making purchase or deployment decisions, consult organizations with relevant experience
	to verify the conclusions about hardware requirements based on your data volume and workload.
	</note>

	<!-- <p outputclass="toc inpage"/> -->

	<p>
	Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently,
	Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration.
	Because work can be distributed in different ways for different queries, if some hosts are overloaded
	compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance
	and overall slowness
	</p>

	<p>
	For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM,
	12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of
	DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads
	similar to TPC-DS benchmark queries:
	</p>

	<table>
	<title>Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time</title>
	<tgroup cols="6">
	<colspec colnum="1" colname="col1"/>
	<colspec colnum="2" colname="col2"/>
	<colspec colnum="3" colname="col3"/>
	<colspec colnum="4" colname="col4"/>
	<colspec colnum="5" colname="col5"/>
	<colspec colnum="6" colname="col6"/>
	<thead>
	<row>
	<entry>
	Data Size
	</entry>
	<entry>
	1 query
	</entry>
	<entry>
	10 queries
	</entry>
	<entry>
	100 queries
	</entry>
	<entry>
	1000 queries
	</entry>
	<entry>
	2000 queries
	</entry>
	</row>
	</thead>
	<tbody>
	<row>
	<entry>
	<b>250 GB</b>
	</entry>
	<entry>
	2
	</entry>
	<entry>
	2
	</entry>
	<entry>
	5
	</entry>
	<entry>
	35
	</entry>
	<entry>
	70
	</entry>
	</row>
	<row>
	<entry>
	<b>500 GB</b>
	</entry>
	<entry>
	2
	</entry>
	<entry>
	2
	</entry>
	<entry>
	10
	</entry>
	<entry>
	70
	</entry>
	<entry>
	135
	</entry>
	</row>
	<row>
	<entry>
	<b>1 TB</b>
	</entry>
	<entry>
	2
	</entry>
	<entry>
	2
	</entry>
	<entry>
	15
	</entry>
	<entry>
	135
	</entry>
	<entry>
	270
	</entry>
	</row>
	<row>
	<entry>
	<b>15 TB</b>
	</entry>
	<entry>
	2
	</entry>
	<entry>
	20
	</entry>
	<entry>
	200
	</entry>
	<entry>
	N/A
	</entry>
	<entry>
	N/A
	</entry>
	</row>
	<row>
	<entry>
	<b>30 TB</b>
	</entry>
	<entry>
	4
	</entry>
	<entry>
	40
	</entry>
	<entry>
	400
	</entry>
	<entry>
	N/A
	</entry>
	<entry>
	N/A
	</entry>
	</row>
	<row>
	<entry>
	<b>60 TB</b>
	</entry>
	<entry>
	8
	</entry>
	<entry>
	80
	</entry>
	<entry>
	800
	</entry>
	<entry>
	N/A
	</entry>
	<entry>
	N/A
	</entry>
	</row>
	</tbody>
	</tgroup>
	</table>

	<section id="sizing_factors">

	<title>Factors Affecting Scalability</title>

	<p>
	A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each
	node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with
	cluster size. However, for some workloads, the scalability might be bounded by the network, or even by
	memory.
	</p>

	<p>
	If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce
	the network load; in fact, a larger cluster could increase network traffic because some queries involve
	<q>broadcast</q> operations to all DataNodes. Therefore, boosting the cluster size does not improve query
	throughput in a network-constrained environment.
	</p>

	<p>
	Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional
	concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor
	network is saturated yet. This can happen because currently Impala uses only a single core per node to
	process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system
	cannot run more than 2 such queries at the same time.
	</p>

	<p>
	Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound
	workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6
	GB/sec. Consider increasing the memory per node.
	</p>

	<p>
	As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the
	throughput estimate.
	</p>
	</section>

	<section id="sizing_details">

	<title>A More Precise Approach</title>

	<p>
	A more precise sizing estimate would require not only queries per minute (QPM), but also an average data
	size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total
	data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed:
	</p>

	<codeblock>Eq 1: N > QPM * D / 100 GB
	</codeblock>

	<p>
	Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is
	required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can
	estimate the number of node using our equation above.
	</p>

	<codeblock>N > QPM * D / 100GB
	N > 400 * 50GB / 100GB
	N > 200
	</codeblock>

	<p>
	Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500.
	</p>

	<p>
	Depending on the complexity of the query, the processing rate of query might change. If the query has more
	joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the
	process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and
	filtering on numbers, the processing rate can be higher.
	</p>
	</section>

	<section id="sizing_mem_estimate">

	<title>Estimating Memory Requirements</title>
	<!--
	<prolog>
	<metadata>
	<data name="Category" value="Memory"/>
	</metadata>
	</prolog>
	-->

	<p>
	Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the
	joined tables, using the <codeph><xref href="impala_compute_stats.xml#compute_stats">COMPUTE
	STATS</xref></codeph> statement. However, joining big tables does consume more memory. Follow the steps
	below to calculate the minimum memory requirement.
	</p>

	<p>
	Suppose you are running the following join:
	</p>

	<codeblock>select a.*, b.col_1, b.col_2, … b.col_n
	from a, b
	where a.key = b.key
	and b.col_1 in (1,2,4...)
	and b.col_4 in (....);
	</codeblock>

	<p>
	And suppose table <codeph>B</codeph> is smaller than table <codeph>A</codeph> (but still a large table).
	</p>

	<p>
	The memory requirement for the query is the right-hand table (<codeph>B</codeph>), after decompression,
	filtering (<codeph>b.col_n in ...</codeph>) and after projection (only using certain columns) must be less
	than the total memory of the entire cluster.
	</p>

	<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
	selectivity factor from the predicate *
	projection factor * compression ratio
	</codeblock>

	<p>
	In this case, assume that table <codeph>B</codeph> is 100 TB in Parquet format with 200 columns. The
	predicate on <codeph>B</codeph> (<codeph>b.col_1 in ...and b.col_4 in ...</codeph>) will select only 10% of
	the rows from <codeph>B</codeph> and for projection, we are only projecting 5 columns out of 200 columns.
	Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor.
	</p>

	<codeblock>Cluster Total Memory Requirement = Size of the smaller table *
	selectivity factor from the predicate *
	projection factor * compression ratio
	= 100TB * 10% * 5/200 * 3
	= 0.75TB
	= 750GB
	</codeblock>

	<p>
	So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1
	TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries
	of this magnitude.
	</p>
	</section>
	</conbody>
	</concept>