blob: 0b196aa1b5763f340bb5ab5cba4b678326dc84f5 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="copyright" content="(C) Copyright 2023" />
<meta name="DC.rights.owner" content="(C) Copyright 2023" />
<meta name="DC.Type" content="concept" />
<meta name="DC.Title" content="OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only)" />
<meta name="DC.Relation" scheme="URI" content="../topics/impala_set.html" />
<meta name="prodname" content="Impala" />
<meta name="prodname" content="Impala" />
<meta name="version" content="Impala 3.4.x" />
<meta name="version" content="Impala 3.4.x" />
<meta name="DC.Format" content="XHTML" />
<meta name="DC.Identifier" content="optimize_partition_key_scans" />
<link rel="stylesheet" type="text/css" href="../commonltr.css" />
<title>OPTIMIZE_PARTITION_KEY_SCANS Query Option (Impala 2.5 or higher only)</title>
</head>
<body id="optimize_partition_key_scans">
<h1 class="title topictitle1" id="ariaid-title1">OPTIMIZE_PARTITION_KEY_SCANS Query Option (<span class="keyword">Impala 2.5</span> or higher only)</h1>
<div class="body conbody">
<p class="p">
Enables a fast code path for queries that apply simple aggregate functions to partition key
columns: <code class="ph codeph">MIN(<var class="keyword varname">key_column</var>)</code>, <code class="ph codeph">MAX(<var class="keyword varname">key_column</var>)</code>,
or <code class="ph codeph">COUNT(DISTINCT <var class="keyword varname">key_column</var>)</code>.
</p>
<p class="p">
<strong class="ph b">Type:</strong> Boolean; recognized values are 1 and 0, or <code class="ph codeph">true</code> and
<code class="ph codeph">false</code>; any other value interpreted as <code class="ph codeph">false</code>
</p>
<p class="p">
<strong class="ph b">Default:</strong> <code class="ph codeph">false</code> (shown as 0 in output of <code class="ph codeph">SET</code>
statement)
</p>
<div class="note note"><span class="notetitle">Note:</span>
In <span class="keyword">Impala 2.5.0</span>, only the value 1 enables the option, and the value
<code class="ph codeph">true</code> is not recognized. This limitation is tracked by the issue
<a class="xref" href="https://issues.apache.org/jira/browse/IMPALA-3334" target="_blank">IMPALA-3334</a>, which shows the releases where the
problem is fixed.
</div>
<p class="p">
<strong class="ph b">Added in:</strong> <span class="keyword">Impala 2.5.0</span>
</p>
<p class="p">
<strong class="ph b">Usage notes:</strong>
</p>
<p class="p">
This optimization speeds up common <span class="q">"introspection"</span> operations
over partition key columns, for example determining the distinct values
of partition keys.
</p>
<div class="p">
This optimization does not apply to <code class="ph codeph">SELECT</code> statements
that reference columns that are not partition keys. It also only applies
when all the partition key columns in the <code class="ph codeph">SELECT</code> statement
are referenced in one of the following contexts:
<ul class="ul">
<li class="li">
<p class="p">
Within a <code class="ph codeph">MAX()</code> or <code class="ph codeph">MAX()</code>
aggregate function or as the argument of any aggregate function with
the <code class="ph codeph">DISTINCT</code> keyword applied.
</p>
</li>
<li class="li">
<p class="p">
Within a <code class="ph codeph">WHERE</code>, <code class="ph codeph">GROUP BY</code>
or <code class="ph codeph">HAVING</code> clause.
</p>
</li>
</ul>
</div>
<p class="p">
This optimization is enabled by a query option because it skips some consistency checks
and therefore can return slightly different partition values if partitions are in the
process of being added, dropped, or loaded outside of Impala. Queries might exhibit different
behavior depending on the setting of this option in the following cases:
</p>
<ul class="ul">
<li class="li">
<p class="p">
If files are removed from a partition using HDFS or other non-Impala operations,
there is a period until the next <code class="ph codeph">REFRESH</code> of the table where regular
queries fail at run time because they detect the missing files. With this optimization
enabled, queries that evaluate only the partition key column values (not the contents of
the partition itself) succeed, and treat the partition as if it still exists.
</p>
</li>
<li class="li">
<p class="p">
If a partition contains any data files, but the data files do not contain any rows,
a regular query considers that the partition does not exist. With this optimization
enabled, the partition is treated as if it exists.
</p>
<p class="p">
If the partition includes no files at all, this optimization does not change the query
behavior: the partition is considered to not exist whether or not this optimization is enabled.
</p>
</li>
</ul>
<p class="p">
<strong class="ph b">Examples:</strong>
</p>
<p class="p">
The following example shows initial schema setup and the default behavior of queries that
return just the partition key column for a table:
</p>
<pre class="pre codeblock"><code>
-- Make a partitioned table with 3 partitions.
create table t1 (s string) partitioned by (year int);
insert into t1 partition (year=2015) values ('last year');
insert into t1 partition (year=2016) values ('this year');
insert into t1 partition (year=2017) values ('next year');
-- Regardless of the option setting, this query must read the
-- data files to know how many rows to return for each year value.
explain select year from t1;
+-----------------------------------------------------+
| Explain String |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
| |
| F00:PLAN FRAGMENT [UNPARTITIONED] |
| 00:SCAN HDFS [key_cols.t1] |
| partitions=3/3 files=4 size=40B |
| table stats: 3 rows total |
| column stats: all |
| hosts=3 per-host-mem=unavailable |
| tuple-ids=0 row-size=4B cardinality=3 |
+-----------------------------------------------------+
-- The aggregation operation means the query does not need to read
-- the data within each partition: the result set contains exactly 1 row
-- per partition, derived from the partition key column value.
-- By default, Impala still includes a 'scan' operation in the query.
explain select distinct year from t1;
+------------------------------------------------------------------------------------+
| Explain String |
+------------------------------------------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
| |
| 01:AGGREGATE [FINALIZE] |
| | group by: year |
| | |
| 00:SCAN HDFS [key_cols.t1] |
| partitions=0/0 files=0 size=0B |
+------------------------------------------------------------------------------------+
</code></pre>
<p class="p">
The following examples show how the plan is made more efficient when the
<code class="ph codeph">OPTIMIZE_PARTITION_KEY_SCANS</code> option is enabled:
</p>
<pre class="pre codeblock"><code>
set optimize_partition_key_scans=1;
OPTIMIZE_PARTITION_KEY_SCANS set to 1
-- The aggregation operation is turned into a UNION internally,
-- with constant values known in advance based on the metadata
-- for the partitioned table.
explain select distinct year from t1;
+-----------------------------------------------------+
| Explain String |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
| |
| F00:PLAN FRAGMENT [UNPARTITIONED] |
| 01:AGGREGATE [FINALIZE] |
| | group by: year |
| | hosts=1 per-host-mem=unavailable |
| | tuple-ids=1 row-size=4B cardinality=3 |
| | |
| 00:UNION |
| constant-operands=3 |
| hosts=1 per-host-mem=unavailable |
| tuple-ids=0 row-size=4B cardinality=3 |
+-----------------------------------------------------+
-- The same optimization applies to other aggregation queries
-- that only return values based on partition key columns:
-- MIN, MAX, COUNT(DISTINCT), and so on.
explain select min(year) from t1;
+-----------------------------------------------------+
| Explain String |
+-----------------------------------------------------+
| Estimated Per-Host Requirements: Memory=0B VCores=0 |
| |
| F00:PLAN FRAGMENT [UNPARTITIONED] |
| 01:AGGREGATE [FINALIZE] |
| | output: min(year) |
| | hosts=1 per-host-mem=unavailable |
| | tuple-ids=1 row-size=4B cardinality=1 |
| | |
| 00:UNION |
| constant-operands=3 |
| hosts=1 per-host-mem=unavailable |
| tuple-ids=0 row-size=4B cardinality=3 |
+-----------------------------------------------------+
</code></pre>
</div>
<div class="related-links">
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a class="link" href="../topics/impala_set.html">SET Statement</a></div>
</div>
</div></body>
</html>