docs/topics/impala_perf_skew.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="perf_skew">

   <title>Detecting and Correcting HDFS Block Skew Conditions</title>
   <titlealts audience="PDF"><navtitle>HDFS Block Skew</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="Performance"/>
       <data name="Category" value="HDFS"/>
       <data name="Category" value="Proof of Concept"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p>
       For best performance of Impala parallel queries, the work is divided equally across hosts in the cluster, and
       all hosts take approximately equal time to finish their work. If one host takes substantially longer than
       others, the extra time needed for the slow host can become the dominant factor in query performance.
       Therefore, one of the first steps in performance tuning for Impala is to detect and correct such conditions.
     </p>

     <p>
       The main cause of uneven performance that you can correct within Impala is <term>skew</term> in the number of
       HDFS data blocks processed by each host, where some hosts process substantially more data blocks than others.
       This condition can occur because of uneven distribution of the data values themselves, for example causing
       certain data files or partitions to be large while others are very small. (Although it is possible to have
       unevenly distributed data without any problems with the distribution of HDFS blocks.) Block skew could also
       be due to the underlying block allocation policies within HDFS, the replication factor of the data files, and
       the way that Impala chooses the host to process each data block.
     </p>

     <p>
       The most convenient way to detect block skew, or slow-host issues in general, is to examine the <q>executive
       summary</q> information from the query profile after running a query:
     </p>

     <ul>
       <li>
         <p>
           In <cmdname>impala-shell</cmdname>, issue the <codeph>SUMMARY</codeph> command immediately after the
           query is complete, to see just the summary information. If you detect issues involving skew, you might
           switch to issuing the <codeph>PROFILE</codeph> command, which displays the summary information followed
           by a detailed performance analysis.
         </p>
       </li>

       <li>
         <p>
           In the Impala debug web UI, click on the <uicontrol>Profile</uicontrol> link associated with the query after it is
           complete. The executive summary information is displayed early in the profile output.
         </p>
       </li>
     </ul>

     <p>
       For each phase of the query, you see an <uicontrol>Avg Time</uicontrol> and a <uicontrol>Max Time</uicontrol>
       value, along with <uicontrol>#Hosts</uicontrol> indicating how many hosts are involved in that query phase.
       For all the phases with <uicontrol>#Hosts</uicontrol> greater than one, look for cases where the maximum time
       is substantially greater than the average time. Focus on the phases that took the longest, for example, those
       taking multiple seconds rather than milliseconds or microseconds.
     </p>

     <p>
       If you detect that some hosts take longer than others, first rule out non-Impala causes. One reason that some
       hosts could be slower than others is if those hosts have less capacity than the others, or if they are
       substantially busier due to unevenly distributed non-Impala workloads:
     </p>

     <ul>
       <li>
         <p>
           For clusters running Impala, keep the relative capacities of all hosts roughly equal. Any cost savings
           from including some underpowered hosts in the cluster will likely be outweighed by poor or uneven
           performance, and the time spent diagnosing performance issues.
         </p>
       </li>

       <li>
         <p>
           If non-Impala workloads cause slowdowns on some hosts but not others, use the appropriate load-balancing
           techniques for the non-Impala components to smooth out the load across the cluster.
         </p>
       </li>
     </ul>

     <p>
       If the hosts on your cluster are evenly powered and evenly loaded, examine the detailed profile output to
       determine which host is taking longer than others for the query phase in question. Examine how many bytes are
       processed during that phase on that host, how much memory is used, and how many bytes are transmitted across
       the network.
     </p>

     <p>
       The most common symptom is a higher number of bytes read on one host than others, due to one host being
       requested to process a higher number of HDFS data blocks. This condition is more likely to occur when the
       number of blocks accessed by the query is relatively small. For example, if you have a 10-node cluster and
       the query processes 10 HDFS blocks, each node might not process exactly one block. If one node sits idle
       while another node processes two blocks, the query could take twice as long as if the data was perfectly
       distributed.
     </p>

     <p>
       Possible solutions in this case include:
     </p>

     <ul>
       <li>
         <p>
           If the query is artificially small, perhaps for benchmarking purposes, scale it up to process a larger
           data set. For example, if some nodes read 10 HDFS data blocks while others read 11, the overall effect of
           the uneven distribution is much lower than when some nodes did twice as much work as others. As a
           guideline, aim for a <q>sweet spot</q> where each node reads 2 GB or more from HDFS per query. Queries
           that process lower volumes than that could experience inconsistent performance that smooths out as
           queries become more data-intensive.
         </p>
       </li>

       <li>
         <p>
           If the query processes only a few large blocks, so that many nodes sit idle and cannot help to
           parallelize the query, consider reducing the overall block size. For example, you might adjust the
           <codeph>PARQUET_FILE_SIZE</codeph> query option before copying or converting data into a Parquet table.
           Or you might adjust the granularity of data files produced earlier in the ETL pipeline by non-Impala
           components. In Impala 2.0 and later, the default Parquet block size is 256 MB, reduced from 1 GB, to
           improve parallelism for common cluster sizes and data volumes.
         </p>
       </li>

       <li>
         <p>
           Reduce the amount of compression applied to the data. For text data files, the highest degree of
           compression (gzip) produces unsplittable files that are more difficult for Impala to process in parallel,
           and require extra memory during processing to hold the compressed and uncompressed data simultaneously.
           For binary formats such as Parquet and Avro, compression can result in fewer data blocks overall, but
           remember that when queries process relatively few blocks, there is less opportunity for parallel
           execution and many nodes in the cluster might sit idle. Note that when Impala writes Parquet data with
           the query option <codeph>COMPRESSION_CODEC=NONE</codeph> enabled, the data is still typically compact due
           to the encoding schemes used by Parquet, independent of the final compression step.
         </p>
       </li>
     </ul>
   </conbody>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="perf_skew">

	<title>Detecting and Correcting HDFS Block Skew Conditions</title>
	<titlealts audience="PDF"><navtitle>HDFS Block Skew</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="Performance"/>
	<data name="Category" value="HDFS"/>
	<data name="Category" value="Proof of Concept"/>
	<data name="Category" value="Administrators"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p>
	For best performance of Impala parallel queries, the work is divided equally across hosts in the cluster, and
	all hosts take approximately equal time to finish their work. If one host takes substantially longer than
	others, the extra time needed for the slow host can become the dominant factor in query performance.
	Therefore, one of the first steps in performance tuning for Impala is to detect and correct such conditions.
	</p>

	<p>
	The main cause of uneven performance that you can correct within Impala is <term>skew</term> in the number of
	HDFS data blocks processed by each host, where some hosts process substantially more data blocks than others.
	This condition can occur because of uneven distribution of the data values themselves, for example causing
	certain data files or partitions to be large while others are very small. (Although it is possible to have
	unevenly distributed data without any problems with the distribution of HDFS blocks.) Block skew could also
	be due to the underlying block allocation policies within HDFS, the replication factor of the data files, and
	the way that Impala chooses the host to process each data block.
	</p>

	<p>
	The most convenient way to detect block skew, or slow-host issues in general, is to examine the <q>executive
	summary</q> information from the query profile after running a query:
	</p>

	<ul>
	<li>
	<p>
	In <cmdname>impala-shell</cmdname>, issue the <codeph>SUMMARY</codeph> command immediately after the
	query is complete, to see just the summary information. If you detect issues involving skew, you might
	switch to issuing the <codeph>PROFILE</codeph> command, which displays the summary information followed
	by a detailed performance analysis.
	</p>
	</li>

	<li>
	<p>
	In the Impala debug web UI, click on the <uicontrol>Profile</uicontrol> link associated with the query after it is
	complete. The executive summary information is displayed early in the profile output.
	</p>
	</li>
	</ul>

	<p>
	For each phase of the query, you see an <uicontrol>Avg Time</uicontrol> and a <uicontrol>Max Time</uicontrol>
	value, along with <uicontrol>#Hosts</uicontrol> indicating how many hosts are involved in that query phase.
	For all the phases with <uicontrol>#Hosts</uicontrol> greater than one, look for cases where the maximum time
	is substantially greater than the average time. Focus on the phases that took the longest, for example, those
	taking multiple seconds rather than milliseconds or microseconds.
	</p>

	<p>
	If you detect that some hosts take longer than others, first rule out non-Impala causes. One reason that some
	hosts could be slower than others is if those hosts have less capacity than the others, or if they are
	substantially busier due to unevenly distributed non-Impala workloads:
	</p>

	<ul>
	<li>
	<p>
	For clusters running Impala, keep the relative capacities of all hosts roughly equal. Any cost savings
	from including some underpowered hosts in the cluster will likely be outweighed by poor or uneven
	performance, and the time spent diagnosing performance issues.
	</p>
	</li>

	<li>
	<p>
	If non-Impala workloads cause slowdowns on some hosts but not others, use the appropriate load-balancing
	techniques for the non-Impala components to smooth out the load across the cluster.
	</p>
	</li>
	</ul>

	<p>
	If the hosts on your cluster are evenly powered and evenly loaded, examine the detailed profile output to
	determine which host is taking longer than others for the query phase in question. Examine how many bytes are
	processed during that phase on that host, how much memory is used, and how many bytes are transmitted across
	the network.
	</p>

	<p>
	The most common symptom is a higher number of bytes read on one host than others, due to one host being
	requested to process a higher number of HDFS data blocks. This condition is more likely to occur when the
	number of blocks accessed by the query is relatively small. For example, if you have a 10-node cluster and
	the query processes 10 HDFS blocks, each node might not process exactly one block. If one node sits idle
	while another node processes two blocks, the query could take twice as long as if the data was perfectly
	distributed.
	</p>

	<p>
	Possible solutions in this case include:
	</p>

	<ul>
	<li>
	<p>
	If the query is artificially small, perhaps for benchmarking purposes, scale it up to process a larger
	data set. For example, if some nodes read 10 HDFS data blocks while others read 11, the overall effect of
	the uneven distribution is much lower than when some nodes did twice as much work as others. As a
	guideline, aim for a <q>sweet spot</q> where each node reads 2 GB or more from HDFS per query. Queries
	that process lower volumes than that could experience inconsistent performance that smooths out as
	queries become more data-intensive.
	</p>
	</li>

	<li>
	<p>
	If the query processes only a few large blocks, so that many nodes sit idle and cannot help to
	parallelize the query, consider reducing the overall block size. For example, you might adjust the
	<codeph>PARQUET_FILE_SIZE</codeph> query option before copying or converting data into a Parquet table.
	Or you might adjust the granularity of data files produced earlier in the ETL pipeline by non-Impala
	components. In Impala 2.0 and later, the default Parquet block size is 256 MB, reduced from 1 GB, to
	improve parallelism for common cluster sizes and data volumes.
	</p>
	</li>

	<li>
	<p>
	Reduce the amount of compression applied to the data. For text data files, the highest degree of
	compression (gzip) produces unsplittable files that are more difficult for Impala to process in parallel,
	and require extra memory during processing to hold the compressed and uncompressed data simultaneously.
	For binary formats such as Parquet and Avro, compression can result in fewer data blocks overall, but
	remember that when queries process relatively few blocks, there is less opportunity for parallel
	execution and many nodes in the cluster might sit idle. Note that when Impala writes Parquet data with
	the query option <codeph>COMPRESSION_CODEC=NONE</codeph> enabled, the data is still typically compact due
	to the encoding schemes used by Parquet, independent of the final compression step.
	</p>
	</li>
	</ul>
	</conbody>
	</concept>