docs/topics/impala_dedicated_coordinator.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <concept id="scalability">

   <title>How to Configure Impala with Dedicated Coordinators</title>

   <titlealts audience="PDF">

     <navtitle>Dedicated Coordinators Optimization</navtitle>

   </titlealts>

   <prolog>
     <metadata>
       <data name="Category" value="Performance"/>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Planning"/>
       <data name="Category" value="Querying"/>
       <data name="Category" value="Memory"/>
       <data name="Category" value="Scalability"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       Each host that runs the Impala Daemon acts as both a coordinator and as an executor, by
       default, managing metadata caching, query compilation, and query execution. In this
       configuration, Impala clients can connect to any Impala daemon and send query requests.
     </p>

     <p>
       During highly concurrent workloads for large-scale queries, the dual roles can cause
       scalability issues because:
     </p>

     <ul>
       <li>
         <p>
           The extra work required for a host to act as the coordinator could interfere with its
           capacity to perform other work for the later phases of the query. For example,
           coordinators can experience significant network and CPU overhead with queries
           containing a large number of query fragments. Each coordinator caches metadata for all
           table partitions and data files, which requires coordinators to be configured with a
           large JVM heap. Executor-only Impala daemons should be configured with the default JVM
           heaps, which leaves more memory available to process joins, aggregations, and other
           operations performed by query executors.
         </p>
       </li>

       <li>
         <p>
           Having a large number of hosts act as coordinators can cause unnecessary network
           overhead, or even timeout errors, as each of those hosts communicates with the
           Statestored daemon for metadata updates.
         </p>
       </li>

       <li>
         <p >
           The "soft limits" imposed by the admission control feature are more likely to be
           exceeded when there are a large number of heavily loaded hosts acting as coordinators.
           Check
           <xref
             href="https://issues.apache.org/jira/browse/IMPALA-3649"
             format="html" scope="external">
           <u>IMPALA-3649</u>
           </xref> and
           <xref
             href="https://issues.apache.org/jira/browse/IMPALA-6437"
             format="html" scope="external">
           <u>IMPALA-6437</u>
           </xref> to see the status of the enhancements to mitigate this issue.
         </p>
       </li>
     </ul>

     <p >
       The following factors can further exacerbate the above issues:
     </p>

     <ul>
       <li >
         <p >
           High number of concurrent query fragments due to query concurrency and/or query
           complexity
         </p>
       </li>

       <li >
         <p >
           Large metadata topic size related to the number of partitions/files/blocks
         </p>
       </li>

       <li >
         <p >
           High number of coordinator nodes
         </p>
       </li>

       <li >
         <p >
           High number of coordinators used in the same resource pool
         </p>
       </li>
     </ul>

     <p>
       If such scalability bottlenecks occur, in Impala 2.9 and higher, you can assign one
       dedicated role to each Impala daemon host, either as a coordinator or as an executor, to
       address the issues.
     </p>

     <ul>
       <li>
         <p>
           All explicit or load-balanced client connections must go to the coordinator hosts.
           These hosts perform the network communication to keep metadata up-to-date and route
           query results to the appropriate clients. The dedicated coordinator hosts do not
           participate in I/O-intensive operations such as scans, and CPU-intensive operations
           such as aggregations.
         </p>
       </li>

       <li>
         <p>
           The executor hosts perform the intensive I/O, CPU, and memory operations that make up
           the bulk of the work for each query. The executors do communicate with the Statestored
           daemon for membership status, but the dedicated executor hosts do not process the
           final result sets for queries.
         </p>
       </li>
     </ul>

     <p >
       Using dedicated coordinators offers the following benefits:
     </p>

     <ul>
       <li >
         <p>
           Reduces memory usage by limiting the number of Impala nodes that need to cache
           metadata.
         </p>
       </li>

       <li >
         <p>
           Provides better concurrency by avoiding coordinator bottleneck.
         </p>
       </li>

       <li>
         <p>
           Eliminates query over-admission.
         </p>
       </li>

       <li >
         <p>
           Reduces resource, especially network, utilization on the Statestored daemon by
           limiting metadata broadcast to a subset of nodes.
         </p>
       </li>

       <li >
         <p>
           Improves reliability and performance for highly concurrent workloads by reducing
           workload stress on coordinators. Dedicated coordinators require 50% or fewer
           connections and threads.
         </p>
       </li>

       <li >
         <p>
           Reduces the number of explicit metadata refreshes required.
         </p>
       </li>

       <li >
         <p>
           Improves diagnosability if a bottleneck or other performance issue arises on a
           specific host, you can narrow down the cause more easily because each host is
           dedicated to specific operations within the overall Impala workload.
         </p>
       </li>
     </ul>

     <p>
       In this configuration with dedicated coordinators / executors, you cannot connect to the
       dedicated executor hosts through clients such as impala-shell or business intelligence
       tools as only the coordinator nodes support client connections.
     </p>

   </conbody>

   <concept id="concept_vhv_4b1_n2b">

     <title>Determining the Optimal Number of Dedicated Coordinators</title>

     <conbody>

       <p >
         You should have the smallest number of coordinators that will still satisfy your
         workload requirements in a cluster. A rough estimation is 1 coordinator for every 50
         executors.
       </p>

       <p>
         To maintain a healthy state and optimal performance, it is recommended that you keep the
         peak utilization of all resources used by Impala, including CPU, the number of threads,
         the number of connections, and RPCs, under 80%.
       </p>

       <p >
         Consider the following factors to determine the right number of coordinators in your
         cluster:
       </p>

       <ul>
         <li >
           <p >
             What is the number of concurrent queries?
           </p>
         </li>

         <li >
           <p >
             What percentage of the workload is DDL?
           </p>
         </li>

         <li >
           <p>
             What is the average query resource usage at the various stages (merge, runtime
             filter, result set size, etc.)?
           </p>
         </li>

         <li >
           <p>
             How many Impala Daemons (impalad) is in the cluster?
           </p>
         </li>

         <li >
           <p>
             Is there a high availability requirement?
           </p>
         </li>

         <li >
           <p>
             Compute/storage capacity reduction factor
           </p>
         </li>
       </ul>

       <p>
         Start with the below set of steps to determine the initial number of coordinators:
       </p>

       <ol>
         <li>
           If your cluster has less than 10 nodes, we recommend that you configure one dedicated
           coordinator. Deploy the dedicated coordinator on a DataNode to avoid losing storage
           capacity. In most of cases, one dedicated coordinator is enough to support all
           workloads on a cluster.
         </li>

         <li>
           Add more coordinators if the dedicated coordinator CPU or network peak utilization is
           80% or higher. You might need 1 coordinator for every 50 executors.
         </li>

         <li>
           If the Impala service is shared by multiple workgroups with a dynamic resource pool
           assigned, use one coordinator per pool to avoid admission control over admission.
         </li>

         <li>
           If high availability is required, double the number of coordinators. One set as an
           active set and the other as a backup set.
         </li>
       </ol>

     </conbody>

     <concept id="concept_y4k_rc1_n2b">

       <title>Advanced Tuning</title>

       <conbody>

         <p >
           Use the following guidelines to further tune the throughput and stability.
           <ol>
             <li >
               The concurrency of DML statements does not typically depend on the number of
               coordinators or size of the cluster. Queries that return large result sets
               (10,000+ rows) consume more CPU and memory resources on the coordinator. Add one
               or two coordinators if the workload has many such queries.
             </li>

             <li>
               DDL queries, excluding <codeph>COMPUTE STATS</codeph> and <codeph>CREATE TABLE AS
               SELECT</codeph>, are executed only on coordinators. If your workload contains many
               DDL queries running concurrently, you could add one coordinator.
             </li>

             <li>
               The CPU contention on coordinators can slow down query executions when concurrency
               is high, especially for very short queries (&lt;10s). Add more coordinators to
               avoid CPU contention.
             </li>

             <li>
               On a large cluster with 50+ nodes, the number of network connections from a
               coordinator to executors can grow quickly as query complexity increases. The
               growth is much greater on coordinators than executors. Add a few more coordinators
               if workloads are complex, i.e. (an average number of fragments * number of
               Impalad) > 500, but with the low memory/CPU usage to share the load. Watch
               IMPALA-4603 and IMPALA-7213 to track the progress on fixing this issue.
             </li>

             <li >
               When using multiple coordinators for DML statements, divide queries to different
               groups (number of groups = number of coordinators). Configure a separate dynamic
               resource pool for each group and direct each group of query requests to a specific
               coordinator. This is to avoid query over admission.
             </li>

             <li>
               The front-end connection requirement is not a factor in determining the number of
               dedicated coordinators. Consider setting up a connection pool at the client side
               instead of adding coordinators. For a short-term solution, you could increase the
               value of <codeph>fe_service_threads</codeph> on coordinators to allow more client
               connections.
             </li>

             <li>
               In general, you should have a very small number of coordinators so storage
               capacity reduction is not a concern. On a very small cluster (less than 10 nodes),
               deploy a dedicated coordinator on a DataNode to avoid storage capacity reduction.
             </li>
           </ol>
         </p>

       </conbody>

     </concept>

     <concept id="concept_w51_cd1_n2b">

       <title>Estimating Coordinator Resource Usage</title>

       <conbody>

         <p>
           <table id="table_kh3_d31_n2b" colsep="1" rowsep="1" frame="all">
             <tgroup cols="3">
               <colspec colnum="1" colname="col1" colsep="1" rowsep="1"/>
               <colspec colnum="2" colname="col2" colsep="1" rowsep="1"/>
               <colspec colnum="3" colname="col3" colsep="1" rowsep="1"/>
               <tbody>
                 <row>
                   <entry >
                     <b>Resource</b>
                   </entry>
                   <entry >
                     <b>Safe range</b>
                   </entry>
                   <entry >
                     <b>Notes / CM tsquery to monitor</b>
                   </entry>
                 </row>
                 <row>
                   <entry >
                     Memory
                   </entry>
                   <entry>
                     <p >
                       (Max JVM heap setting +
                     </p>


                     <p >
                       query concurrency *
                     </p>


                     <p >
                       query mem_limit)
                     </p>


                     <p >
                       &lt;=
                     </p>


                     <p >
                       80% of Impala process memory allocation
                     </p>
                   </entry>
                   <entry>
                     <p >
                       <i>Memory usage</i>:
                     </p>


                     <p >
                       SELECT mem_rss WHERE entityName = "Coordinator Instance ID" AND category =
                       ROLE
                     </p>


                     <p >
                       <i>JVM heap usage (metadata cache)</i>:
                     </p>


                     <p >
                       SELECT impala_jvm_heap_current_usage_bytes WHERE entityName = "Coordinator
                       Instance ID" AND category = ROLE (only in release 5.15 and above)
                     </p>
                   </entry>
                 </row>
                 <row>
                   <entry >
                     TCP Connection
                   </entry>
                   <entry >
                     Incoming + outgoing &lt; 16K
                   </entry>
                   <entry>
                     <p >
                       <i>Incoming connection usage</i>:
                     </p>


                     <p >
                       SELECT thrift_server_backend_connections_in_use WHERE entityName =
                       "Coordinator Instance ID" AND category = ROLE
                     </p>


                     <p >
                       <i>Outgoing connection usage</i>:
                     </p>


                     <p >
                       SELECT backends_client_cache_clients_in_use WHERE entityName =
                       "Coordinator Instance ID" AND category = ROLE
                     </p>
                   </entry>
                 </row>
                 <row>
                   <entry >
                     Threads
                   </entry>
                   <entry >
                     &lt; 32K
                   </entry>
                   <entry >
                     SELECT thread_manager_running_threads WHERE entityName = "Coordinator
                     Instance ID" AND category = ROLE
                   </entry>
                 </row>
                 <row>
                   <entry >
                     CPU
                   </entry>
                   <entry>
                     <p >
                       Concurrency =
                     </p>


                     <p >
                       non-DDL query concurrency &lt;=
                     </p>


                     <p>
                       number of virtual cores allocated to Impala per node
                     </p>
                   </entry>
                   <entry>
                     <p >
                       CPU usage estimation should be based on how many cores are allocated to
                       Impala per node, not a sum of all cores of the cluster.
                     </p>


                     <p>
                       It is recommended that concurrency should not be more than the number of
                       virtual cores allocated to Impala per node.
                     </p>


                     <p ></p>


                     <p >
                       <i>Query concurrency:</i>
                     </p>


                     <p>
                       SELECT total_impala_num_queries_registered_across_impalads WHERE
                       entityName = "IMPALA-1" AND category = SERVICE
                     </p>


                     <p ></p>
                   </entry>
                 </row>
               </tbody>
             </tgroup>
           </table>
         </p>

         <p>
           If usage of any of the above resources exceeds the safe range, add one more
           coordinator.
         </p>

       </conbody>

     </concept>

   </concept>

   <concept id="concept_omm_gf1_n2b">

     <title>Deploying Dedicated Coordinators and Executors</title>

     <conbody>

       <p >
         This section describes the process to configure a dedicated coordinator and a dedicated
         executor roles for Impala.
       </p>

       <ul>
         <li >
           <p>
             <b>Dedicated coordinator</b>: They should be on edge node with no other services
             running on it. They don’t need large local disks but still need some that can be
             used for Spilling. They require at least same or even larger memory allocation.
           </p>
         </li>

         <li >
           <p>
             <b>(Dedicated) Executors: </b>They should be colocated with DataNodes as usual. The
             number of hosts with this setting typically increases as the cluster grows larger
             and handles more table partitions, data files, and concurrent queries.
           </p>
         </li>
       </ul>

       <p>
         To configuring dedicated coordinators/executors, you specify one of the following
         startup flags for the <cmdname>impalad</cmdname> daemon on each host:
         <ul>
           <li>
             <p>
               <codeph>&#8209;&#8209;is_executor=false</codeph> for each host that does not act
               as an executor for Impala queries. These hosts act exclusively as query
               coordinators. This setting typically applies to a relatively small number of
               hosts, because the most common topology is to have nearly all DataNodes doing work
               for query execution.
             </p>
           </li>

           <li>
             <p>
               <codeph>&#8209;&#8209;is_coordinator=false</codeph> for each host that does not
               act as a coordinator for Impala queries. These hosts act exclusively as executors.
               The number of hosts with this setting typically increases as the cluster grows
               larger and handles more table partitions, data files, and concurrent queries. As
               the overhead for query coordination increases, it becomes more important to
               centralize that work on dedicated hosts.
             </p>
           </li>
         </ul>
       </p>

     </conbody>

   </concept>

 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<concept id="scalability">

	<title>How to Configure Impala with Dedicated Coordinators</title>

	<titlealts audience="PDF">

	<navtitle>Dedicated Coordinators Optimization</navtitle>

	</titlealts>

	<prolog>
	<metadata>
	<data name="Category" value="Performance"/>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Planning"/>
	<data name="Category" value="Querying"/>
	<data name="Category" value="Memory"/>
	<data name="Category" value="Scalability"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	Each host that runs the Impala Daemon acts as both a coordinator and as an executor, by
	default, managing metadata caching, query compilation, and query execution. In this
	configuration, Impala clients can connect to any Impala daemon and send query requests.
	</p>

	<p>
	During highly concurrent workloads for large-scale queries, the dual roles can cause
	scalability issues because:
	</p>

	<ul>
	<li>
	<p>
	The extra work required for a host to act as the coordinator could interfere with its
	capacity to perform other work for the later phases of the query. For example,
	coordinators can experience significant network and CPU overhead with queries
	containing a large number of query fragments. Each coordinator caches metadata for all
	table partitions and data files, which requires coordinators to be configured with a
	large JVM heap. Executor-only Impala daemons should be configured with the default JVM
	heaps, which leaves more memory available to process joins, aggregations, and other
	operations performed by query executors.
	</p>
	</li>

	<li>
	<p>
	Having a large number of hosts act as coordinators can cause unnecessary network
	overhead, or even timeout errors, as each of those hosts communicates with the
	Statestored daemon for metadata updates.
	</p>
	</li>

	<li>
	<p >
	The "soft limits" imposed by the admission control feature are more likely to be
	exceeded when there are a large number of heavily loaded hosts acting as coordinators.
	Check
	<xref
	href="https://issues.apache.org/jira/browse/IMPALA-3649"
	format="html" scope="external">
	<u>IMPALA-3649</u>
	</xref> and
	<xref
	href="https://issues.apache.org/jira/browse/IMPALA-6437"
	format="html" scope="external">
	<u>IMPALA-6437</u>
	</xref> to see the status of the enhancements to mitigate this issue.
	</p>
	</li>
	</ul>

	<p >
	The following factors can further exacerbate the above issues:
	</p>

	<ul>
	<li >
	<p >
	High number of concurrent query fragments due to query concurrency and/or query
	complexity
	</p>
	</li>

	<li >
	<p >
	Large metadata topic size related to the number of partitions/files/blocks
	</p>
	</li>

	<li >
	<p >
	High number of coordinator nodes
	</p>
	</li>

	<li >
	<p >
	High number of coordinators used in the same resource pool
	</p>
	</li>
	</ul>

	<p>
	If such scalability bottlenecks occur, in Impala 2.9 and higher, you can assign one
	dedicated role to each Impala daemon host, either as a coordinator or as an executor, to
	address the issues.
	</p>

	<ul>
	<li>
	<p>
	All explicit or load-balanced client connections must go to the coordinator hosts.
	These hosts perform the network communication to keep metadata up-to-date and route
	query results to the appropriate clients. The dedicated coordinator hosts do not
	participate in I/O-intensive operations such as scans, and CPU-intensive operations
	such as aggregations.
	</p>
	</li>

	<li>
	<p>
	The executor hosts perform the intensive I/O, CPU, and memory operations that make up
	the bulk of the work for each query. The executors do communicate with the Statestored
	daemon for membership status, but the dedicated executor hosts do not process the
	final result sets for queries.
	</p>
	</li>
	</ul>

	<p >
	Using dedicated coordinators offers the following benefits:
	</p>

	<ul>
	<li >
	<p>
	Reduces memory usage by limiting the number of Impala nodes that need to cache
	metadata.
	</p>
	</li>

	<li >
	<p>
	Provides better concurrency by avoiding coordinator bottleneck.
	</p>
	</li>

	<li>
	<p>
	Eliminates query over-admission.
	</p>
	</li>

	<li >
	<p>
	Reduces resource, especially network, utilization on the Statestored daemon by
	limiting metadata broadcast to a subset of nodes.
	</p>
	</li>

	<li >
	<p>
	Improves reliability and performance for highly concurrent workloads by reducing
	workload stress on coordinators. Dedicated coordinators require 50% or fewer
	connections and threads.
	</p>
	</li>

	<li >
	<p>
	Reduces the number of explicit metadata refreshes required.
	</p>
	</li>

	<li >
	<p>
	Improves diagnosability if a bottleneck or other performance issue arises on a
	specific host, you can narrow down the cause more easily because each host is
	dedicated to specific operations within the overall Impala workload.
	</p>
	</li>
	</ul>

	<p>
	In this configuration with dedicated coordinators / executors, you cannot connect to the
	dedicated executor hosts through clients such as impala-shell or business intelligence
	tools as only the coordinator nodes support client connections.
	</p>

	</conbody>

	<concept id="concept_vhv_4b1_n2b">

	<title>Determining the Optimal Number of Dedicated Coordinators</title>

	<conbody>

	<p >
	You should have the smallest number of coordinators that will still satisfy your
	workload requirements in a cluster. A rough estimation is 1 coordinator for every 50
	executors.
	</p>

	<p>
	To maintain a healthy state and optimal performance, it is recommended that you keep the
	peak utilization of all resources used by Impala, including CPU, the number of threads,
	the number of connections, and RPCs, under 80%.
	</p>

	<p >
	Consider the following factors to determine the right number of coordinators in your
	cluster:
	</p>

	<ul>
	<li >
	<p >
	What is the number of concurrent queries?
	</p>
	</li>

	<li >
	<p >
	What percentage of the workload is DDL?
	</p>
	</li>

	<li >
	<p>
	What is the average query resource usage at the various stages (merge, runtime
	filter, result set size, etc.)?
	</p>
	</li>

	<li >
	<p>
	How many Impala Daemons (impalad) is in the cluster?
	</p>
	</li>

	<li >
	<p>
	Is there a high availability requirement?
	</p>
	</li>

	<li >
	<p>
	Compute/storage capacity reduction factor
	</p>
	</li>
	</ul>

	<p>
	Start with the below set of steps to determine the initial number of coordinators:
	</p>

	<ol>
	<li>
	If your cluster has less than 10 nodes, we recommend that you configure one dedicated
	coordinator. Deploy the dedicated coordinator on a DataNode to avoid losing storage
	capacity. In most of cases, one dedicated coordinator is enough to support all
	workloads on a cluster.
	</li>

	<li>
	Add more coordinators if the dedicated coordinator CPU or network peak utilization is
	80% or higher. You might need 1 coordinator for every 50 executors.
	</li>

	<li>
	If the Impala service is shared by multiple workgroups with a dynamic resource pool
	assigned, use one coordinator per pool to avoid admission control over admission.
	</li>

	<li>
	If high availability is required, double the number of coordinators. One set as an
	active set and the other as a backup set.
	</li>
	</ol>

	</conbody>

	<concept id="concept_y4k_rc1_n2b">

	<title>Advanced Tuning</title>

	<conbody>

	<p >
	Use the following guidelines to further tune the throughput and stability.
	<ol>
	<li >
	The concurrency of DML statements does not typically depend on the number of
	coordinators or size of the cluster. Queries that return large result sets
	(10,000+ rows) consume more CPU and memory resources on the coordinator. Add one
	or two coordinators if the workload has many such queries.
	</li>

	<li>
	DDL queries, excluding <codeph>COMPUTE STATS</codeph> and <codeph>CREATE TABLE AS
	SELECT</codeph>, are executed only on coordinators. If your workload contains many
	DDL queries running concurrently, you could add one coordinator.
	</li>

	<li>
	The CPU contention on coordinators can slow down query executions when concurrency
	is high, especially for very short queries (<10s). Add more coordinators to
	avoid CPU contention.
	</li>

	<li>
	On a large cluster with 50+ nodes, the number of network connections from a
	coordinator to executors can grow quickly as query complexity increases. The
	growth is much greater on coordinators than executors. Add a few more coordinators
	if workloads are complex, i.e. (an average number of fragments * number of
	Impalad) > 500, but with the low memory/CPU usage to share the load. Watch
	IMPALA-4603 and IMPALA-7213 to track the progress on fixing this issue.
	</li>

	<li >
	When using multiple coordinators for DML statements, divide queries to different
	groups (number of groups = number of coordinators). Configure a separate dynamic
	resource pool for each group and direct each group of query requests to a specific
	coordinator. This is to avoid query over admission.
	</li>

	<li>
	The front-end connection requirement is not a factor in determining the number of
	dedicated coordinators. Consider setting up a connection pool at the client side
	instead of adding coordinators. For a short-term solution, you could increase the
	value of <codeph>fe_service_threads</codeph> on coordinators to allow more client
	connections.
	</li>

	<li>
	In general, you should have a very small number of coordinators so storage
	capacity reduction is not a concern. On a very small cluster (less than 10 nodes),
	deploy a dedicated coordinator on a DataNode to avoid storage capacity reduction.
	</li>
	</ol>
	</p>

	</conbody>

	</concept>

	<concept id="concept_w51_cd1_n2b">

	<title>Estimating Coordinator Resource Usage</title>

	<conbody>

	<p>
	<table id="table_kh3_d31_n2b" colsep="1" rowsep="1" frame="all">
	<tgroup cols="3">
	<colspec colnum="1" colname="col1" colsep="1" rowsep="1"/>
	<colspec colnum="2" colname="col2" colsep="1" rowsep="1"/>
	<colspec colnum="3" colname="col3" colsep="1" rowsep="1"/>
	<tbody>
	<row>
	<entry >
	<b>Resource</b>
	</entry>
	<entry >
	<b>Safe range</b>
	</entry>
	<entry >
	<b>Notes / CM tsquery to monitor</b>
	</entry>
	</row>
	<row>
	<entry >
	Memory
	</entry>
	<entry>
	<p >
	(Max JVM heap setting +
	</p>



	<p >
	query concurrency *
	</p>



	<p >
	query mem_limit)
	</p>



	<p >
	<=
	</p>



	<p >
	80% of Impala process memory allocation
	</p>
	</entry>
	<entry>
	<p >
	<i>Memory usage</i>:
	</p>



	<p >
	SELECT mem_rss WHERE entityName = "Coordinator Instance ID" AND category =
	ROLE
	</p>



	<p >
	<i>JVM heap usage (metadata cache)</i>:
	</p>



	<p >
	SELECT impala_jvm_heap_current_usage_bytes WHERE entityName = "Coordinator
	Instance ID" AND category = ROLE (only in release 5.15 and above)
	</p>
	</entry>
	</row>
	<row>
	<entry >
	TCP Connection
	</entry>
	<entry >
	Incoming + outgoing < 16K
	</entry>
	<entry>
	<p >
	<i>Incoming connection usage</i>:
	</p>



	<p >
	SELECT thrift_server_backend_connections_in_use WHERE entityName =
	"Coordinator Instance ID" AND category = ROLE
	</p>



	<p >
	<i>Outgoing connection usage</i>:
	</p>



	<p >
	SELECT backends_client_cache_clients_in_use WHERE entityName =
	"Coordinator Instance ID" AND category = ROLE
	</p>
	</entry>
	</row>
	<row>
	<entry >
	Threads
	</entry>
	<entry >
	< 32K
	</entry>
	<entry >
	SELECT thread_manager_running_threads WHERE entityName = "Coordinator
	Instance ID" AND category = ROLE
	</entry>
	</row>
	<row>
	<entry >
	CPU
	</entry>
	<entry>
	<p >
	Concurrency =
	</p>



	<p >
	non-DDL query concurrency <=
	</p>



	<p>
	number of virtual cores allocated to Impala per node
	</p>
	</entry>
	<entry>
	<p >
	CPU usage estimation should be based on how many cores are allocated to
	Impala per node, not a sum of all cores of the cluster.
	</p>



	<p>
	It is recommended that concurrency should not be more than the number of
	virtual cores allocated to Impala per node.
	</p>



	<p ></p>



	<p >
	<i>Query concurrency:</i>
	</p>



	<p>
	SELECT total_impala_num_queries_registered_across_impalads WHERE
	entityName = "IMPALA-1" AND category = SERVICE
	</p>



	<p ></p>
	</entry>
	</row>
	</tbody>
	</tgroup>
	</table>
	</p>

	<p>
	If usage of any of the above resources exceeds the safe range, add one more
	coordinator.
	</p>

	</conbody>

	</concept>

	</concept>

	<concept id="concept_omm_gf1_n2b">

	<title>Deploying Dedicated Coordinators and Executors</title>

	<conbody>

	<p >
	This section describes the process to configure a dedicated coordinator and a dedicated
	executor roles for Impala.
	</p>

	<ul>
	<li >
	<p>
	<b>Dedicated coordinator</b>: They should be on edge node with no other services
	running on it. They don’t need large local disks but still need some that can be
	used for Spilling. They require at least same or even larger memory allocation.
	</p>
	</li>

	<li >
	<p>
	<b>(Dedicated) Executors: </b>They should be colocated with DataNodes as usual. The
	number of hosts with this setting typically increases as the cluster grows larger
	and handles more table partitions, data files, and concurrent queries.
	</p>
	</li>
	</ul>

	<p>
	To configuring dedicated coordinators/executors, you specify one of the following
	startup flags for the <cmdname>impalad</cmdname> daemon on each host:
	<ul>
	<li>
	<p>
	<codeph>‑‑is_executor=false</codeph> for each host that does not act
	as an executor for Impala queries. These hosts act exclusively as query
	coordinators. This setting typically applies to a relatively small number of
	hosts, because the most common topology is to have nearly all DataNodes doing work
	for query execution.
	</p>
	</li>

	<li>
	<p>
	<codeph>‑‑is_coordinator=false</codeph> for each host that does not
	act as a coordinator for Impala queries. These hosts act exclusively as executors.
	The number of hosts with this setting typically increases as the cluster grows
	larger and handles more table partitions, data files, and concurrent queries. As
	the overhead for query coordination increases, it becomes more important to
	centralize that work on dedicated hosts.
	</p>
	</li>
	</ul>
	</p>

	</conbody>

	</concept>

	</concept>