blob: ec519d0228f8ea16eb110e94a462f214b2a5a129 [file] [log] [blame]
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
<concept id="runtime_filtering" rev="2.5.0">
<title id="runtime_filters">Runtime Filtering for Impala Queries (<keyword keyref="impala25"/> or higher only)</title>
<titlealts audience="PDF"><navtitle>Runtime Filtering</navtitle></titlealts>
<prolog>
<metadata>
<data name="Category" value="Impala"/>
<data name="Category" value="SQL"/>
<data name="Category" value="Querying"/>
<data name="Category" value="Performance"/>
<data name="Category" value="Developers"/>
<data name="Category" value="Data Analysts"/>
</metadata>
</prolog>
<conbody>
<p rev="2.5.0">
<indexterm audience="hidden">runtime filtering</indexterm>
<term>Runtime filtering</term> is a wide-ranging optimization feature available in
<keyword keyref="impala25_full"/> and higher. When only a fraction of the data in a table is
needed for a query against a partitioned table or to evaluate a join condition,
Impala determines the appropriate conditions while the query is running, and
broadcasts that information to all the <cmdname>impalad</cmdname> nodes that are reading the table
so that they can avoid unnecessary I/O to read partition data, and avoid
unnecessary network transmission by sending only the subset of rows that match the join keys
across the network.
</p>
<p>
This feature is primarily used to optimize queries against large partitioned tables
(under the name <term>dynamic partition pruning</term>) and joins of large tables.
The information in this section includes concepts, internals, and troubleshooting
information for the entire runtime filtering feature.
For specific tuning steps for partitioned tables,
<!-- and join queries, -->
see
<xref href="impala_partitioning.xml#dynamic_partition_pruning"/>.
<!-- and <xref href="impala_joins.xml#joins"/>. -->
</p>
<note type="important" rev="2.6.0">
<p rev="2.6.0">
When this feature made its debut in <keyword keyref="impala25"/>,
the default setting was <codeph>RUNTIME_FILTER_MODE=LOCAL</codeph>.
Now the default is <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph> in <keyword keyref="impala26_full"/> and higher,
which enables more wide-ranging and ambitious query optimization without requiring you to
explicitly set any query options.
</p>
</note>
<p outputclass="toc inpage"/>
</conbody>
<concept id="runtime_filtering_concepts">
<title>Background Information for Runtime Filtering</title>
<conbody>
<p>
To understand how runtime filtering works at a detailed level, you must
be familiar with some terminology from the field of distributed database technology:
</p>
<ul>
<li>
<p> What a <term>plan fragment</term> is. Impala decomposes each query
into smaller units of work that are distributed across the cluster.
Wherever possible, a data block is read, filtered, and aggregated by
plan fragments executing on the same host. For some operations, such
as joins and combining intermediate results into a final result set,
data is transmitted across the network from one Impala daemon to
another. </p>
</li>
<li>
<p>
What <codeph>SCAN</codeph> and <codeph>HASH JOIN</codeph> plan nodes are, and their role in computing query results:
</p>
<p>
In the Impala query plan, a <term>scan node</term> performs the I/O to read from the underlying data files.
Although this is an expensive operation from the traditional database perspective, Hadoop clusters and Impala are
optimized to do this kind of I/O in a highly parallel fashion. The major potential cost savings come from using
the columnar Parquet format (where Impala can avoid reading data for unneeded columns) and partitioned tables
(where Impala can avoid reading data for unneeded partitions).
</p>
<p>
Most Impala joins use the
<xref href="https://en.wikipedia.org/wiki/Hash_join" scope="external" format="html"><term>hash join</term></xref>
mechanism. (It is only fairly recently that Impala
started using the nested-loop join technique, for certain kinds of non-equijoin queries.)
In a hash join, when evaluating join conditions from two tables, Impala constructs a hash table in memory with all
the different column values from the table on one side of the join.
Then, for each row from the table on the other side of the join, Impala tests whether the relevant column values
are in this hash table or not.
</p>
<p>
A <term>hash join node</term> constructs such an in-memory hash table, then performs the comparisons to
identify which rows match the relevant join conditions
and should be included in the result set (or at least sent on to the subsequent intermediate stage of
query processing). Because some of the input for a hash join might be transmitted across the network from another host,
it is especially important from a performance perspective to prune out ahead of time any data that is known to be
irrelevant.
</p>
<p>
The more distinct values are in the columns used as join keys, the larger the in-memory hash table and
thus the more memory required to process the query.
</p>
</li>
<li>
<p>
The difference between a <term>broadcast join</term> and a <term>shuffle join</term>.
(The Hadoop notion of a shuffle join is sometimes referred to in Impala as a <term>partitioned join</term>.)
In a broadcast join, the table from one side of the join (typically the smaller table)
is sent in its entirety to all the hosts involved in the query. Then each host can compare its
portion of the data from the other (larger) table against the full set of possible join keys.
In a shuffle join, there is no obvious <q>smaller</q> table, and so the contents of both tables
are divided up, and corresponding portions of the data are transmitted to each host involved in the query.
See <xref href="impala_hints.xml#hints"/> for information about how these different kinds of
joins are processed.
</p>
</li>
<li>
<p>
The notion of the build phase and probe phase when Impala processes a join query.
The <term>build phase</term> is where the rows containing the join key columns, typically for the smaller table,
are transmitted across the network and built into an in-memory hash table data structure on one or
more destination nodes.
The <term>probe phase</term> is where data is read locally (typically from the larger table) and the join key columns
are compared to the values in the in-memory hash table.
The corresponding input sources (tables, subqueries, and so on) for these
phases are referred to as the <term>build side</term> and the <term>probe side</term>.
</p>
</li>
<li>
<p>
How to set Impala query options: interactively within an <cmdname>impala-shell</cmdname> session through
the <codeph>SET</codeph> command, for a JDBC or ODBC application through the <codeph>SET</codeph> statement, or
globally for all <cmdname>impalad</cmdname> daemons through the <codeph>default_query_options</codeph> configuration
setting.
</p>
</li>
</ul>
</conbody>
</concept>
<concept id="runtime_filtering_internals">
<title>Runtime Filtering Internals</title>
<conbody>
<p>
The <term>filter</term> that is transmitted between plan fragments is essentially a list
of values for join key columns. When this list of values is transmitted in time to a scan node,
Impala can filter out non-matching values immediately after reading them, rather than transmitting
the raw data to another host to compare against the in-memory hash table on that host.
</p>
<p>
For HDFS-based tables, this data structure is implemented as a <term>Bloom filter</term>, which uses
a probability-based algorithm to determine all possible matching values. (The probability-based aspects
means that the filter might include some non-matching values, but if so, that does not cause any inaccuracy
in the final results.)
</p>
<p rev="2.11.0 IMPALA-4252">
Another kind of filter is the <q>min-max</q> filter. It currently only applies to Kudu tables. The
filter is a data structure representing a minimum and maximum value. These filters are passed to
Kudu to reduce the number of rows returned to Impala when scanning the probe side of the join.
</p>
<p>
There are different kinds of filters to match the different kinds of joins (partitioned and broadcast).
A broadcast filter reflects the complete list of relevant values and can be immediately evaluated by a scan node.
A partitioned filter reflects only the values processed by one host in the
cluster; all the partitioned filters must be combined into one (by the coordinator node) before the
scan nodes can use the results to accurately filter the data as it is read from storage.
</p>
<p>
Broadcast filters are also classified as local or global. With a local broadcast filter, the information
in the filter is used by a subsequent query fragment that is running on the same host that produced the filter.
A non-local broadcast filter must be transmitted across the network to a query fragment that is running on a
different host. Impala designates 3 hosts to each produce non-local broadcast filters, to guard against the
possibility of a single slow host taking too long. Depending on the setting of the <codeph>RUNTIME_FILTER_MODE</codeph> query option
(<codeph>LOCAL</codeph> or <codeph>GLOBAL</codeph>), Impala either uses a conservative optimization
strategy where filters are only consumed on the same host that produced them, or a more aggressive strategy
where filters are eligible to be transmitted across the network.
</p>
<note rev="2.6.0 IMPALA-3333">
In <keyword keyref="impala26_full"/> and higher, the default for runtime filtering is the <codeph>GLOBAL</codeph> setting.
</note>
</conbody>
</concept>
<concept id="runtime_filtering_file_formats">
<title>File Format Considerations for Runtime Filtering</title>
<conbody>
<p>
Parquet tables get the most benefit from
the runtime filtering optimizations. Runtime filtering can speed up
join queries against partitioned or unpartitioned Parquet tables,
and single-table queries against partitioned Parquet tables.
See <xref href="impala_parquet.xml#parquet"/> for information about
using Parquet tables with Impala.
</p>
<p>
For other file formats (text, Avro, RCFile, and SequenceFile),
runtime filtering speeds up queries against partitioned tables only.
Because partitioned tables can use a mixture of formats, Impala produces
the filters in all cases, even if they are not ultimately used to
optimize the query.
</p>
</conbody>
</concept>
<concept id="runtime_filtering_timing">
<title>Wait Intervals for Runtime Filters</title>
<conbody>
<p>
Because it takes time to produce runtime filters, especially for
partitioned filters that must be combined by the coordinator node,
there is a time interval above which it is more efficient for
the scan nodes to go ahead and construct their intermediate result sets,
even if that intermediate data is larger than optimal. If it only takes
a few seconds to produce the filters, it is worth the extra time if pruning
the unnecessary data can save minutes in the overall query time.
You can specify the maximum wait time in milliseconds using the
<codeph>RUNTIME_FILTER_WAIT_TIME_MS</codeph> query option.
</p>
<p>
By default, each scan node waits for up to 1 second (1000 milliseconds)
for filters to arrive. If all filters have not arrived within the
specified interval, the scan node proceeds, using whatever filters
did arrive to help avoid reading unnecessary data. If a filter arrives
after the scan node begins reading data, the scan node applies that
filter to the data that is read after the filter arrives, but not to
the data that was already read.
</p>
<p>
If the cluster is relatively busy and your workload contains many
resource-intensive or long-running queries, consider increasing the wait time
so that complicated queries do not miss opportunities for optimization.
If the cluster is lightly loaded and your workload contains many small queries
taking only a few seconds, consider decreasing the wait time to avoid the
1 second delay for each query.
</p>
</conbody>
</concept>
<concept id="runtime_filtering_query_options">
<title>Query Options for Runtime Filtering</title>
<conbody>
<p>
See the following sections for information about the query options that control runtime filtering:
</p>
<ul>
<li>
<p>
The first query option adjusts the <q>sensitivity</q> of this feature.
<ph rev="2.6.0 IMPALA-3333">By default, it is set to the highest level (<codeph>GLOBAL</codeph>).
(This default applies to <keyword keyref="impala26_full"/> and higher.
In previous releases, the default was <codeph>LOCAL</codeph>.)</ph>
</p>
<ul>
<li>
<p>
<xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/>
</p>
</li>
</ul>
</li>
<li>
<p>
The other query options are tuning knobs that you typically only adjust after doing
performance testing, and that you might want to change only for the duration of a single
expensive query:
</p>
<ul>
<li>
<p>
<xref href="impala_max_num_runtime_filters.xml#max_num_runtime_filters"/>
</p>
</li>
<li>
<p>
<xref href="impala_disable_row_runtime_filtering.xml#disable_row_runtime_filtering"/>
</p>
</li>
<li>
<p rev="2.6.0 IMPALA-3480">
<xref href="impala_runtime_filter_max_size.xml#runtime_filter_max_size"/>
</p>
</li>
<li>
<p rev="2.6.0 IMPALA-3480">
<xref href="impala_runtime_filter_min_size.xml#runtime_filter_min_size"/>
</p>
</li>
<li>
<p rev="2.6.0 IMPALA-3007">
<xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size"/>;
in <keyword keyref="impala26_full"/> and higher, this setting acts as a fallback when
statistics are not available, rather than as a directive.
</p>
</li>
</ul>
</li>
</ul>
</conbody>
</concept>
<concept id="runtime_filtering_explain_plan">
<title>Runtime Filtering and Query Plans</title>
<conbody>
<p>
In the same way the query plan displayed by the
<codeph>EXPLAIN</codeph> statement includes information
about predicates used by each plan fragment, it also
includes annotations showing whether a plan fragment
produces or consumes a runtime filter.
A plan fragment that produces a filter includes an
annotation such as
<codeph>runtime filters: <varname>filter_id</varname> &lt;- <varname>table</varname>.<varname>column</varname></codeph>,
while a plan fragment that consumes a filter includes an annotation such as
<codeph>runtime filters: <varname>filter_id</varname> -&gt; <varname>table</varname>.<varname>column</varname></codeph>.
<ph rev="2.11.0 IMPALA-4252">Setting the query option <codeph>EXPLAIN_LEVEL=2</codeph> adds additional
annotations showing the type of the filter, either <codeph><varname>filter_id</varname>[bloom]</codeph>
(for HDFS-based tables) or <codeph><varname>filter_id</varname>[min_max]</codeph> (for Kudu tables).</ph>
</p>
<p>
The following example shows a query that uses a single runtime filter,
labeled <codeph>RF000</codeph>, to prune the partitions based on
evaluating the result set of a subquery at runtime:
</p>
<codeblock conref="../shared/impala_common.xml#common/simple_dpp_example"/>
<p>
The query profile (displayed by the <codeph>PROFILE</codeph> command
in <cmdname>impala-shell</cmdname>) contains both the
<codeph>EXPLAIN</codeph> plan and more detailed information about the
internal workings of the query. The profile output includes the
<codeph>Filter routing table</codeph> section with information about
each filter based on its ID.
</p>
</conbody>
</concept>
<concept id="runtime_filtering_queries">
<title>Examples of Queries that Benefit from Runtime Filtering</title>
<conbody>
<p>
In this example, Impala would normally do extra work to interpret the columns
<codeph>C1</codeph>, <codeph>C2</codeph>, <codeph>C3</codeph>, and <codeph>ID</codeph>
for each row in <codeph>HUGE_T1</codeph>, before checking the <codeph>ID</codeph>
value against the in-memory hash table constructed from all the <codeph>TINY_T2.ID</codeph>
values. By producing a filter containing all the <codeph>TINY_T2.ID</codeph> values
even before the query starts scanning the <codeph>HUGE_T1</codeph> table, Impala
can skip the unnecessary work to parse the column info as soon as it determines
that an <codeph>ID</codeph> value does not match any of the values from the other table.
</p>
<p>
The example shows <codeph>COMPUTE STATS</codeph> statements for both the tables (even
though that is a one-time operation after loading data into those tables) because
Impala relies on up-to-date statistics to
determine which one has more distinct <codeph>ID</codeph> values than the other.
That information lets Impala make effective decisions about which table to use to
construct the in-memory hash table, and which table to read from disk and
compare against the entries in the hash table.
</p>
<codeblock rev="2.6.0">
COMPUTE STATS huge_t1;
COMPUTE STATS tiny_t2;
SELECT c1, c2, c3 FROM huge_t1 JOIN tiny_t2 WHERE huge_t1.id = tiny_t2.id;
</codeblock>
<!-- The greater-than comparison prevents runtime filtering from applying. Comment out for now;
put back if the example can be reworked in a way that does produce some filters.
<p>
In this example, <codeph>T1</codeph> is a table partitioned by year. The subquery
on <codeph>T2</codeph> produces a single value with the <codeph>MIN(year)</codeph> result,
and transmits that value as a filter to the plan fragments that are reading from <codeph>T1</codeph>.
Any non-matching partitions in <codeph>T1</codeph> are skipped.
</p>
<codeblock>
select c1 from t1 where year > (select min(year) from t2);
</codeblock>
-->
<p>
In this example, <codeph>T1</codeph> is a table partitioned by year. The subquery
on <codeph>T2</codeph> produces multiple values, and transmits those values as a filter to the plan
fragments that are reading from <codeph>T1</codeph>. Any non-matching partitions in <codeph>T1</codeph>
are skipped.
</p>
<codeblock rev="2.6.0">
select c1 from t1 where year in (select distinct year from t2);
</codeblock>
<p>
Now the <codeph>WHERE</codeph> clause contains an additional test that does not apply to
the partition key column.
A filter on a column that is not a partition key is called a per-row filter.
Because per-row filters only apply for Parquet, <codeph>T1</codeph> must be a Parquet table.
</p>
<p>
The subqueries result in two filters being transmitted to
the scan nodes that read from <codeph>T1</codeph>. The filter on <codeph>YEAR</codeph> helps the query eliminate
entire partitions based on non-matching years. The filter on <codeph>C2</codeph> lets Impala discard
rows with non-matching <codeph>C2</codeph> values immediately after reading them. Without runtime filtering,
Impala would have to keep the non-matching values in memory, assemble <codeph>C1</codeph>, <codeph>C2</codeph>,
and <codeph>C3</codeph> into rows in the intermediate result set, and transmit all the intermediate rows
back to the coordinator node, where they would be eliminated only at the very end of the query.
</p>
<codeblock rev="2.6.0">
select c1, c2, c3 from t1
where year in (select distinct year from t2)
and c2 in (select other_column from t3);
</codeblock>
<p>
This example involves a broadcast join.
The fact that the <codeph>ON</codeph> clause would
return a small number of matching rows (because there
are not very many rows in <codeph>TINY_T2</codeph>)
means that the corresponding filter is very selective.
Therefore, runtime filtering will probably be effective
in optimizing this query.
</p>
<codeblock rev="2.6.0">
select c1 from huge_t1 join [broadcast] tiny_t2
on huge_t1.id = tiny_t2.id
where huge_t1.year in (select distinct year from tiny_t2)
and c2 in (select other_column from t3);
</codeblock>
<p>
This example involves a shuffle or partitioned join.
Assume that most rows in <codeph>HUGE_T1</codeph>
have a corresponding row in <codeph>HUGE_T2</codeph>.
The fact that the <codeph>ON</codeph> clause could
return a large number of matching rows means that
the corresponding filter would not be very selective.
Therefore, runtime filtering might be less effective
in optimizing this query.
</p>
<codeblock rev="2.6.0">
select c1 from huge_t1 join [shuffle] huge_t2
on huge_t1.id = huge_t2.id
where huge_t1.year in (select distinct year from huge_t2)
and c2 in (select other_column from t3);
</codeblock>
</conbody>
</concept>
<concept id="runtime_filtering_tuning">
<title>Tuning and Troubleshooting Queries that Use Runtime Filtering</title>
<conbody>
<p>
These tuning and troubleshooting procedures apply to queries that are
resource-intensive enough, long-running enough, and frequent enough
that you can devote special attention to optimizing them individually.
</p>
<p>
Use the <codeph>EXPLAIN</codeph> statement and examine the <codeph>runtime filters:</codeph>
lines to determine whether runtime filters are being applied to the <codeph>WHERE</codeph> predicates
and join clauses that you expect. For example, runtime filtering does not apply to queries that use
the nested loop join mechanism due to non-equijoin operators.
</p>
<p>
Make sure statistics are up-to-date for all tables involved in the queries.
Use the <codeph>COMPUTE STATS</codeph> statement after loading data into non-partitioned tables,
and <codeph>COMPUTE INCREMENTAL STATS</codeph> after adding new partitions to partitioned tables.
</p>
<p>
If join queries involving large tables use unique columns as the join keys,
for example joining a primary key column with a foreign key column, the overhead of
producing and transmitting the filter might outweigh the performance benefit because
not much data could be pruned during the early stages of the query.
For such queries, consider setting the query option <codeph>RUNTIME_FILTER_MODE=OFF</codeph>.
</p>
</conbody>
</concept>
<concept id="runtime_filtering_limits">
<title>Limitations and Restrictions for Runtime Filtering</title>
<conbody>
<p>
The runtime filtering feature is most effective for the Parquet file formats.
For other file formats, filtering only applies for partitioned tables.
See <xref href="impala_runtime_filtering.xml#runtime_filtering_file_formats"/>.
For the ways in which runtime filtering works for Kudu tables, see
<xref href="impala_kudu.xml#kudu_performance"/>.
</p>
<!-- To do: check if this restriction is lifted in 5.8 / 2.6. -->
<p rev="IMPALA-3054">
When the spill-to-disk mechanism is activated on a particular host during a query,
that host does not produce any filters while processing that query.
This limitation does not affect the correctness of results; it only reduces the
amount of optimization that can be applied to the query.
</p>
</conbody>
</concept>
</concept>