docs/topics/impala_explain.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="explain">

   <title>EXPLAIN Statement</title>
   <titlealts audience="PDF"><navtitle>EXPLAIN</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Querying"/>
       <data name="Category" value="Reports"/>
       <data name="Category" value="Planning"/>
       <data name="Category" value="Performance"/>
       <data name="Category" value="Troubleshooting"/>
       <data name="Category" value="Administrators"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p> Returns the execution plan for a statement, showing the low-level
       mechanisms that Impala will use to read the data, divide the work among
       nodes in the cluster, and transmit intermediate and final results across
       the network. Use <codeph>explain</codeph> followed by a complete
         <codeph>SELECT</codeph> query. For example: </p>

     <p conref="../shared/impala_common.xml#common/syntax_blurb"/>

 <codeblock>EXPLAIN { <varname>select_query</varname> | <varname>ctas_stmt</varname> | <varname>insert_stmt</varname> }
 </codeblock>

     <p>
       The <varname>select_query</varname> is a <codeph>SELECT</codeph> statement, optionally prefixed by a
       <codeph>WITH</codeph> clause. See <xref href="impala_select.xml#select"/> for details.
     </p>

     <p>
       The <varname>insert_stmt</varname> is an <codeph>INSERT</codeph> statement that inserts into or overwrites an
       existing table. It can use either the <codeph>INSERT ... SELECT</codeph> or <codeph>INSERT ...
       VALUES</codeph> syntax. See <xref href="impala_insert.xml#insert"/> for details.
     </p>

     <p>
       The <varname>ctas_stmt</varname> is a <codeph>CREATE TABLE</codeph> statement using the <codeph>AS
       SELECT</codeph> clause, typically abbreviated as a <q>CTAS</q> operation. See
       <xref href="impala_create_table.xml#create_table"/> for details.
     </p>

     <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

     <p>
       You can interpret the output to judge whether the query is performing efficiently, and adjust the query
       and/or the schema if not. For example, you might change the tests in the <codeph>WHERE</codeph> clause, add
       hints to make join operations more efficient, introduce subqueries, change the order of tables in a join, add
       or change partitioning for a table, collect column statistics and/or table statistics in Hive, or any other
       performance tuning steps.
     </p>

     <p>
       The <codeph>EXPLAIN</codeph> output reminds you if table or column statistics are missing from any table
       involved in the query. These statistics are important for optimizing queries involving large tables or
       multi-table joins. See <xref href="impala_compute_stats.xml#compute_stats"/> for how to gather statistics,
       and <xref href="impala_perf_stats.xml#perf_stats"/> for how to use this information for query tuning.
     </p>

     <p conref="../shared/impala_common.xml#common/explain_interpret"/>

     <p>
       If you come from a traditional database background and are not familiar with data warehousing, keep in mind
       that Impala is optimized for full table scans across very large tables. The structure and distribution of
       this data is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP
       environments. Seeing a query scan entirely through a large table is common, not necessarily an indication of
       an inefficient query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for
       example by using a query that affects only certain partitions within a partitioned table, then you might be
       able to optimize a query so that it executes in seconds rather than minutes.
     </p>
     <p> The <codeph>EXPLAIN</codeph> output becomes more accurate and informative
       as statistics are gathered by the <codeph>COMPUTE STATS</codeph>
       statement. Initially, the information about data size and distribution,
       such as the number of rows or number of distinct values for each column,
       is marked "<codeph>unavailable</codeph>". The <codeph>COMPUTE
         STATS</codeph> statement performs the analysis, so a subsequent
         <codeph>EXPLAIN</codeph> statement has additional information to use in
       deciding how to optimize the distributed query. </p>

     <p>
       For more information and examples to help you interpret <codeph>EXPLAIN</codeph> output, see
       <xref href="impala_explain_plan.xml#perf_explain"/>.
     </p>

     <p rev="1.2">
       <b>Extended EXPLAIN output:</b>
     </p>

     <p rev="1.2"> For performance tuning of complex queries, and capacity
       planning (such as using the admission control and resource management
       features), you can enable more detailed and informative output for the
         <codeph>EXPLAIN</codeph> statement. In the
         <cmdname>impala-shell</cmdname> interpreter, issue the command
         <codeph>SET EXPLAIN_LEVEL=<varname>level</varname></codeph>, where
         <varname>level</varname> is <codeph>MINIMAL</codeph>,
         <codeph>STANDARD</codeph>, <codeph>EXTENDED</codeph>, or
         <codeph>VERBOSE</codeph>. </p>

     <p rev="1.2"> When extended <codeph>EXPLAIN</codeph> output is enabled,
         <codeph>EXPLAIN</codeph> statements print information about estimated
       memory requirements, minimum number of virtual cores, and so on. </p>
     <p>Starting in <keyword keyref="impala32_full"/>, if the
         <codeph>EXPLAIN_LEVEL</codeph> option is set to
         <codeph>EXTENDED</codeph> level or <codeph>VERBOSE</codeph>, the output
       contains the following additional information.</p>
     <ul>
       <li>The analyzed query, in the output header. <p>The analyzed query may
           have been rewritten to include various optimizations and implicit
           casts. See the example below.</p></li>
       <li>The predicates in the plan output includes the same implicit casts and
         literals printed with a cast to show the type.</li>
     </ul>

     <p>
       See <xref href="impala_explain_level.xml#explain_level"/> for details and examples.
     </p>

     <p conref="../shared/impala_common.xml#common/example_blurb"/>

     <p> This example shows how the standard <codeph>EXPLAIN</codeph> output
       moves from the lowest (physical) level to the higher (logical) levels. </p>
     <ol>
       <li>The query begins by scanning a certain amount of data; each node
         performs an aggregation operation (evaluating <codeph>COUNT(*)</codeph>)
         on some subset of data that is local to that node.</li>
       <li>The intermediate results are transmitted back to the coordinator node
         (labelled here as the <codeph>EXCHANGE</codeph> node).</li>
       <li>Lastly, the intermediate results are summed to display the final
         result. </li>
     </ol>

 <codeblock id="explain_plan_simple">[impalad-host:21000] &gt; EXPLAIN SELECT COUNT(*) FROM customer_address;
 +----------------------------------------------------------+
 | Explain String                                           |
 +----------------------------------------------------------+
 | ...                                                      |
 |                                                          |
 | 03:AGGREGATE [FINALIZE]                                  |
 | |  output: sum(count(*))                                 |
 | |                                                        |
 | 02:EXCHANGE [UNPARTITIONED]                              |
 | |                                                        |
 | 01:AGGREGATE                                             |
 | |  output: count(*)                                      |
 | |                                                        |
 | 00:SCAN HDFS [default.customer_address]                  |
 |    partitions=1/1 size=5.25MB                            |
 +----------------------------------------------------------+
 </codeblock>
     <p>The following example shows an extended <codeph>EXPLAIN</codeph> output.
       Note that the analyzed query was rewritten to include:</p>
     <ul>
       <li>The 'constant folding' optimization, which simplified the expression
         in the original query, '<codeph>1000 / 100</codeph>' to
           '<codeph>10</codeph>'.</li>
       <li>The implicit casts in the <codeph>WHERE</codeph> clause.</li>
     </ul>
     <codeblock>EXPLAIN SELECT * FROM functional_kudu.alltypestiny WHERE bigint_col &lt; 1000 / 100;
 +----------------------------------------------------------+
 | Explain String                                           |
 +----------------------------------------------------------+
 | ...
 | Analyzed query: SELECT * FROM mytable WHERE CAST(bigint_col AS DOUBLE) &lt; CAST(10 AS DOUBLE)
 | ...
 | 00:SCAN KUDU [functional_kudu.alltypestiny]
 | predicates: CAST(bigint_col AS DOUBLE) &lt; CAST(10 AS DOUBLE)
 ...</codeblock>

     <p conref="../shared/impala_common.xml#common/security_blurb"/>
     <p conref="../shared/impala_common.xml#common/redaction_yes"/>

     <p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>

     <p conref="../shared/impala_common.xml#common/permissions_blurb"/>
     <p rev="">
       <!-- Doublecheck these details. Does EXPLAIN really need any permissions? -->
       The user ID that the <cmdname>impalad</cmdname> daemon runs under,
       typically the <codeph>impala</codeph> user, must have read
       and execute permissions for all applicable directories in all source tables
       for the query that is being explained.
       (A <codeph>SELECT</codeph> operation could read files from multiple different HDFS directories
       if the source table is partitioned.)
     </p>

     <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/>
     <p>
       The <codeph>EXPLAIN</codeph> statement displays equivalent plan
       information for queries against Kudu tables as for queries
       against HDFS-based tables.
     </p>

     <p>
       To see which predicates Impala can <q>push down</q> to Kudu for
       efficient evaluation, without transmitting unnecessary rows back
       to Impala, look for the <codeph>kudu predicates</codeph> item in
       the scan phase of the query. The label <codeph>kudu predicates</codeph>
       indicates a condition that can be evaluated efficiently on the Kudu
       side. The label <codeph>predicates</codeph> in a <codeph>SCAN KUDU</codeph>
       node indicates a condition that is evaluated by Impala.
       For example, in a table with primary key column <codeph>X</codeph>
       and non-primary key column <codeph>Y</codeph>, you can see that
       some operators in the <codeph>WHERE</codeph> clause are evaluated
       immediately by Kudu and others are evaluated later by Impala:
     </p>

 <codeblock rev="2.9.0 IMPALA-4859">
 EXPLAIN SELECT x,y from kudu_table WHERE
   x = 1 AND y NOT IN (2,3) AND z = 1
   AND a IS NOT NULL AND b > 0 AND length(s) > 5;
 +----------------
 | Explain String
 +----------------
 ...
 | 00:SCAN KUDU [kudu_table]
 |    predicates: y NOT IN (2, 3), length(s) > 5
 |    kudu predicates: a IS NOT NULL, b > 0, x = 1, z = 1
 </codeblock>

     <p rev="2.9.0 IMPALA-4859">
       Only binary predicates, <codeph>IS NULL</codeph> and <codeph>IS NOT NULL</codeph>
       (in <keyword keyref="impala29"/> and higher), and <codeph>IN</codeph> predicates
       containing literal values that exactly match the types in the Kudu table, and do not
       require any casting, can be pushed to Kudu.
     </p>

     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_select.xml#select"/>,
       <xref href="impala_insert.xml#insert"/>,
       <xref href="impala_create_table.xml#create_table"/>,
       <xref href="impala_explain_plan.xml#explain_plan"/>
     </p>

   </conbody>
 </concept>
	<?xml version="1.0" encoding="UTF-8"?>
	<!--
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->
	<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
	<concept id="explain">

	<title>EXPLAIN Statement</title>
	<titlealts audience="PDF"><navtitle>EXPLAIN</navtitle></titlealts>
	<prolog>
	<metadata>
	<data name="Category" value="Impala"/>
	<data name="Category" value="SQL"/>
	<data name="Category" value="Querying"/>
	<data name="Category" value="Reports"/>
	<data name="Category" value="Planning"/>
	<data name="Category" value="Performance"/>
	<data name="Category" value="Troubleshooting"/>
	<data name="Category" value="Administrators"/>
	<data name="Category" value="Developers"/>
	<data name="Category" value="Data Analysts"/>
	</metadata>
	</prolog>

	<conbody>

	<p> Returns the execution plan for a statement, showing the low-level
	mechanisms that Impala will use to read the data, divide the work among
	nodes in the cluster, and transmit intermediate and final results across
	the network. Use <codeph>explain</codeph> followed by a complete
	<codeph>SELECT</codeph> query. For example: </p>

	<p conref="../shared/impala_common.xml#common/syntax_blurb"/>

	<codeblock>EXPLAIN { <varname>select_query</varname> \| <varname>ctas_stmt</varname> \| <varname>insert_stmt</varname> }
	</codeblock>

	<p>
	The <varname>select_query</varname> is a <codeph>SELECT</codeph> statement, optionally prefixed by a
	<codeph>WITH</codeph> clause. See <xref href="impala_select.xml#select"/> for details.
	</p>

	<p>
	The <varname>insert_stmt</varname> is an <codeph>INSERT</codeph> statement that inserts into or overwrites an
	existing table. It can use either the <codeph>INSERT ... SELECT</codeph> or <codeph>INSERT ...
	VALUES</codeph> syntax. See <xref href="impala_insert.xml#insert"/> for details.
	</p>

	<p>
	The <varname>ctas_stmt</varname> is a <codeph>CREATE TABLE</codeph> statement using the <codeph>AS
	SELECT</codeph> clause, typically abbreviated as a <q>CTAS</q> operation. See
	<xref href="impala_create_table.xml#create_table"/> for details.
	</p>

	<p conref="../shared/impala_common.xml#common/usage_notes_blurb"/>

	<p>
	You can interpret the output to judge whether the query is performing efficiently, and adjust the query
	and/or the schema if not. For example, you might change the tests in the <codeph>WHERE</codeph> clause, add
	hints to make join operations more efficient, introduce subqueries, change the order of tables in a join, add
	or change partitioning for a table, collect column statistics and/or table statistics in Hive, or any other
	performance tuning steps.
	</p>

	<p>
	The <codeph>EXPLAIN</codeph> output reminds you if table or column statistics are missing from any table
	involved in the query. These statistics are important for optimizing queries involving large tables or
	multi-table joins. See <xref href="impala_compute_stats.xml#compute_stats"/> for how to gather statistics,
	and <xref href="impala_perf_stats.xml#perf_stats"/> for how to use this information for query tuning.
	</p>

	<p conref="../shared/impala_common.xml#common/explain_interpret"/>

	<p>
	If you come from a traditional database background and are not familiar with data warehousing, keep in mind
	that Impala is optimized for full table scans across very large tables. The structure and distribution of
	this data is typically not suitable for the kind of indexing and single-row lookups that are common in OLTP
	environments. Seeing a query scan entirely through a large table is common, not necessarily an indication of
	an inefficient query. Of course, if you can reduce the volume of scanned data by orders of magnitude, for
	example by using a query that affects only certain partitions within a partitioned table, then you might be
	able to optimize a query so that it executes in seconds rather than minutes.
	</p>
	<p> The <codeph>EXPLAIN</codeph> output becomes more accurate and informative
	as statistics are gathered by the <codeph>COMPUTE STATS</codeph>
	statement. Initially, the information about data size and distribution,
	such as the number of rows or number of distinct values for each column,
	is marked "<codeph>unavailable</codeph>". The <codeph>COMPUTE
	STATS</codeph> statement performs the analysis, so a subsequent
	<codeph>EXPLAIN</codeph> statement has additional information to use in
	deciding how to optimize the distributed query. </p>

	<p>
	For more information and examples to help you interpret <codeph>EXPLAIN</codeph> output, see
	<xref href="impala_explain_plan.xml#perf_explain"/>.
	</p>

	<p rev="1.2">
	<b>Extended EXPLAIN output:</b>
	</p>

	<p rev="1.2"> For performance tuning of complex queries, and capacity
	planning (such as using the admission control and resource management
	features), you can enable more detailed and informative output for the
	<codeph>EXPLAIN</codeph> statement. In the
	<cmdname>impala-shell</cmdname> interpreter, issue the command
	<codeph>SET EXPLAIN_LEVEL=<varname>level</varname></codeph>, where
	<varname>level</varname> is <codeph>MINIMAL</codeph>,
	<codeph>STANDARD</codeph>, <codeph>EXTENDED</codeph>, or
	<codeph>VERBOSE</codeph>. </p>

	<p rev="1.2"> When extended <codeph>EXPLAIN</codeph> output is enabled,
	<codeph>EXPLAIN</codeph> statements print information about estimated
	memory requirements, minimum number of virtual cores, and so on. </p>
	<p>Starting in <keyword keyref="impala32_full"/>, if the
	<codeph>EXPLAIN_LEVEL</codeph> option is set to
	<codeph>EXTENDED</codeph> level or <codeph>VERBOSE</codeph>, the output
	contains the following additional information.</p>
	<ul>
	<li>The analyzed query, in the output header. <p>The analyzed query may
	have been rewritten to include various optimizations and implicit
	casts. See the example below.</p></li>
	<li>The predicates in the plan output includes the same implicit casts and
	literals printed with a cast to show the type.</li>
	</ul>

	<p>
	See <xref href="impala_explain_level.xml#explain_level"/> for details and examples.
	</p>

	<p conref="../shared/impala_common.xml#common/example_blurb"/>

	<p> This example shows how the standard <codeph>EXPLAIN</codeph> output
	moves from the lowest (physical) level to the higher (logical) levels. </p>
	<ol>
	<li>The query begins by scanning a certain amount of data; each node
	performs an aggregation operation (evaluating <codeph>COUNT(*)</codeph>)
	on some subset of data that is local to that node.</li>
	<li>The intermediate results are transmitted back to the coordinator node
	(labelled here as the <codeph>EXCHANGE</codeph> node).</li>
	<li>Lastly, the intermediate results are summed to display the final
	result. </li>
	</ol>

	<codeblock id="explain_plan_simple">[impalad-host:21000] > EXPLAIN SELECT COUNT(*) FROM customer_address;
	+----------------------------------------------------------+
	\| Explain String \|
	+----------------------------------------------------------+
	\| ... \|
	\| \|
	\| 03:AGGREGATE [FINALIZE] \|
	\| \| output: sum(count(*)) \|
	\| \| \|
	\| 02:EXCHANGE [UNPARTITIONED] \|
	\| \| \|
	\| 01:AGGREGATE \|
	\| \| output: count(*) \|
	\| \| \|
	\| 00:SCAN HDFS [default.customer_address] \|
	\| partitions=1/1 size=5.25MB \|
	+----------------------------------------------------------+
	</codeblock>
	<p>The following example shows an extended <codeph>EXPLAIN</codeph> output.
	Note that the analyzed query was rewritten to include:</p>
	<ul>
	<li>The 'constant folding' optimization, which simplified the expression
	in the original query, '<codeph>1000 / 100</codeph>' to
	'<codeph>10</codeph>'.</li>
	<li>The implicit casts in the <codeph>WHERE</codeph> clause.</li>
	</ul>
	<codeblock>EXPLAIN SELECT * FROM functional_kudu.alltypestiny WHERE bigint_col < 1000 / 100;
	+----------------------------------------------------------+
	\| Explain String \|
	+----------------------------------------------------------+
	\| ...
	\| Analyzed query: SELECT * FROM mytable WHERE CAST(bigint_col AS DOUBLE) < CAST(10 AS DOUBLE)
	\| ...
	\| 00:SCAN KUDU [functional_kudu.alltypestiny]
	\| predicates: CAST(bigint_col AS DOUBLE) < CAST(10 AS DOUBLE)
	...</codeblock>

	<p conref="../shared/impala_common.xml#common/security_blurb"/>
	<p conref="../shared/impala_common.xml#common/redaction_yes"/>

	<p conref="../shared/impala_common.xml#common/cancel_blurb_no"/>

	<p conref="../shared/impala_common.xml#common/permissions_blurb"/>
	<p rev="">
	<!-- Doublecheck these details. Does EXPLAIN really need any permissions? -->
	The user ID that the <cmdname>impalad</cmdname> daemon runs under,
	typically the <codeph>impala</codeph> user, must have read
	and execute permissions for all applicable directories in all source tables
	for the query that is being explained.
	(A <codeph>SELECT</codeph> operation could read files from multiple different HDFS directories
	if the source table is partitioned.)
	</p>

	<p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/>
	<p>
	The <codeph>EXPLAIN</codeph> statement displays equivalent plan
	information for queries against Kudu tables as for queries
	against HDFS-based tables.
	</p>

	<p>
	To see which predicates Impala can <q>push down</q> to Kudu for
	efficient evaluation, without transmitting unnecessary rows back
	to Impala, look for the <codeph>kudu predicates</codeph> item in
	the scan phase of the query. The label <codeph>kudu predicates</codeph>
	indicates a condition that can be evaluated efficiently on the Kudu
	side. The label <codeph>predicates</codeph> in a <codeph>SCAN KUDU</codeph>
	node indicates a condition that is evaluated by Impala.
	For example, in a table with primary key column <codeph>X</codeph>
	and non-primary key column <codeph>Y</codeph>, you can see that
	some operators in the <codeph>WHERE</codeph> clause are evaluated
	immediately by Kudu and others are evaluated later by Impala:
	</p>

	<codeblock rev="2.9.0 IMPALA-4859">
	EXPLAIN SELECT x,y from kudu_table WHERE
	x = 1 AND y NOT IN (2,3) AND z = 1
	AND a IS NOT NULL AND b > 0 AND length(s) > 5;
	+----------------
	\| Explain String
	+----------------
	...
	\| 00:SCAN KUDU [kudu_table]
	\| predicates: y NOT IN (2, 3), length(s) > 5
	\| kudu predicates: a IS NOT NULL, b > 0, x = 1, z = 1
	</codeblock>

	<p rev="2.9.0 IMPALA-4859">
	Only binary predicates, <codeph>IS NULL</codeph> and <codeph>IS NOT NULL</codeph>
	(in <keyword keyref="impala29"/> and higher), and <codeph>IN</codeph> predicates
	containing literal values that exactly match the types in the Kudu table, and do not
	require any casting, can be pushed to Kudu.
	</p>

	<p conref="../shared/impala_common.xml#common/related_info"/>
	<p>
	<xref href="impala_select.xml#select"/>,
	<xref href="impala_insert.xml#insert"/>,
	<xref href="impala_create_table.xml#create_table"/>,
	<xref href="impala_explain_plan.xml#explain_plan"/>
	</p>

	</conbody>
	</concept>