| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="runtime_filtering" rev="2.5.0"> |
| |
| <title id="runtime_filters">Runtime Filtering for Impala Queries (<keyword keyref="impala25"/> or higher only)</title> |
| <titlealts audience="PDF"><navtitle>Runtime Filtering</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Performance"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p rev="2.5.0"> |
| <indexterm audience="hidden">runtime filtering</indexterm> |
| <term>Runtime filtering</term> is a wide-ranging optimization feature available in |
| <keyword keyref="impala25_full"/> and higher. When only a fraction of the data in a table is |
| needed for a query against a partitioned table or to evaluate a join condition, |
| Impala determines the appropriate conditions while the query is running, and |
| broadcasts that information to all the <cmdname>impalad</cmdname> nodes that are reading the table |
| so that they can avoid unnecessary I/O to read partition data, and avoid |
| unnecessary network transmission by sending only the subset of rows that match the join keys |
| across the network. |
| </p> |
| |
| <p> |
| This feature is primarily used to optimize queries against large partitioned tables |
| (under the name <term>dynamic partition pruning</term>) and joins of large tables. |
| The information in this section includes concepts, internals, and troubleshooting |
| information for the entire runtime filtering feature. |
| For specific tuning steps for partitioned tables, |
| <!-- and join queries, --> |
| see |
| <xref href="impala_partitioning.xml#dynamic_partition_pruning"/>. |
| <!-- and <xref href="impala_joins.xml#joins"/>. --> |
| </p> |
| |
| <note type="important" rev="2.6.0"> |
| <p rev="2.6.0"> |
| When this feature made its debut in <keyword keyref="impala25"/>, |
| the default setting was <codeph>RUNTIME_FILTER_MODE=LOCAL</codeph>. |
| Now the default is <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph> in <keyword keyref="impala26_full"/> and higher, |
| which enables more wide-ranging and ambitious query optimization without requiring you to |
| explicitly set any query options. |
| </p> |
| </note> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="runtime_filtering_concepts"> |
| <title>Background Information for Runtime Filtering</title> |
| <conbody> |
| <p> |
| To understand how runtime filtering works at a detailed level, you must |
| be familiar with some terminology from the field of distributed database technology: |
| </p> |
| <ul> |
| <li> |
| <p> What a <term>plan fragment</term> is. Impala decomposes each query |
| into smaller units of work that are distributed across the cluster. |
| Wherever possible, a data block is read, filtered, and aggregated by |
| plan fragments executing on the same host. For some operations, such |
| as joins and combining intermediate results into a final result set, |
| data is transmitted across the network from one Impala daemon to |
| another. </p> |
| </li> |
| <li> |
| <p> |
| What <codeph>SCAN</codeph> and <codeph>HASH JOIN</codeph> plan nodes are, and their role in computing query results: |
| </p> |
| <p> |
| In the Impala query plan, a <term>scan node</term> performs the I/O to read from the underlying data files. |
| Although this is an expensive operation from the traditional database perspective, Hadoop clusters and Impala are |
| optimized to do this kind of I/O in a highly parallel fashion. The major potential cost savings come from using |
| the columnar Parquet format (where Impala can avoid reading data for unneeded columns) and partitioned tables |
| (where Impala can avoid reading data for unneeded partitions). |
| </p> |
| <p> |
| Most Impala joins use the |
| <xref href="https://en.wikipedia.org/wiki/Hash_join" scope="external" format="html"><term>hash join</term></xref> |
| mechanism. (It is only fairly recently that Impala |
| started using the nested-loop join technique, for certain kinds of non-equijoin queries.) |
| In a hash join, when evaluating join conditions from two tables, Impala constructs a hash table in memory with all |
| the different column values from the table on one side of the join. |
| Then, for each row from the table on the other side of the join, Impala tests whether the relevant column values |
| are in this hash table or not. |
| </p> |
| <p> |
| A <term>hash join node</term> constructs such an in-memory hash table, then performs the comparisons to |
| identify which rows match the relevant join conditions |
| and should be included in the result set (or at least sent on to the subsequent intermediate stage of |
| query processing). Because some of the input for a hash join might be transmitted across the network from another host, |
| it is especially important from a performance perspective to prune out ahead of time any data that is known to be |
| irrelevant. |
| </p> |
| <p> |
| The more distinct values are in the columns used as join keys, the larger the in-memory hash table and |
| thus the more memory required to process the query. |
| </p> |
| </li> |
| <li> |
| <p> |
| The difference between a <term>broadcast join</term> and a <term>shuffle join</term>. |
| (The Hadoop notion of a shuffle join is sometimes referred to in Impala as a <term>partitioned join</term>.) |
| In a broadcast join, the table from one side of the join (typically the smaller table) |
| is sent in its entirety to all the hosts involved in the query. Then each host can compare its |
| portion of the data from the other (larger) table against the full set of possible join keys. |
| In a shuffle join, there is no obvious <q>smaller</q> table, and so the contents of both tables |
| are divided up, and corresponding portions of the data are transmitted to each host involved in the query. |
| See <xref href="impala_hints.xml#hints"/> for information about how these different kinds of |
| joins are processed. |
| </p> |
| </li> |
| <li> |
| <p> |
| The notion of the build phase and probe phase when Impala processes a join query. |
| The <term>build phase</term> is where the rows containing the join key columns, typically for the smaller table, |
| are transmitted across the network and built into an in-memory hash table data structure on one or |
| more destination nodes. |
| The <term>probe phase</term> is where data is read locally (typically from the larger table) and the join key columns |
| are compared to the values in the in-memory hash table. |
| The corresponding input sources (tables, subqueries, and so on) for these |
| phases are referred to as the <term>build side</term> and the <term>probe side</term>. |
| </p> |
| </li> |
| <li> |
| <p> |
| How to set Impala query options: interactively within an <cmdname>impala-shell</cmdname> session through |
| the <codeph>SET</codeph> command, for a JDBC or ODBC application through the <codeph>SET</codeph> statement, or |
| globally for all <cmdname>impalad</cmdname> daemons through the <codeph>default_query_options</codeph> configuration |
| setting. |
| </p> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="runtime_filtering_internals"> |
| <title>Runtime Filtering Internals</title> |
| <conbody> |
| <p> |
| The <term>filter</term> that is transmitted between plan fragments is essentially a list |
| of values for join key columns. When this list of values is transmitted in time to a scan node, |
| Impala can filter out non-matching values immediately after reading them, rather than transmitting |
| the raw data to another host to compare against the in-memory hash table on that host. |
| </p> |
| <p> |
| For HDFS-based tables, this data structure is implemented as a <term>Bloom filter</term>, which uses |
| a probability-based algorithm to determine all possible matching values. (The probability-based aspects |
| means that the filter might include some non-matching values, but if so, that does not cause any inaccuracy |
| in the final results.) |
| </p> |
| <p rev="2.11.0 IMPALA-4252"> |
| Another kind of filter is the <q>min-max</q> filter. It currently only applies to Kudu tables. The |
| filter is a data structure representing a minimum and maximum value. These filters are passed to |
| Kudu to reduce the number of rows returned to Impala when scanning the probe side of the join. |
| </p> |
| <p> |
| There are different kinds of filters to match the different kinds of joins (partitioned and broadcast). |
| A broadcast filter reflects the complete list of relevant values and can be immediately evaluated by a scan node. |
| A partitioned filter reflects only the values processed by one host in the |
| cluster; all the partitioned filters must be combined into one (by the coordinator node) before the |
| scan nodes can use the results to accurately filter the data as it is read from storage. |
| </p> |
| <p> |
| Broadcast filters are also classified as local or global. With a local broadcast filter, the information |
| in the filter is used by a subsequent query fragment that is running on the same host that produced the filter. |
| A non-local broadcast filter must be transmitted across the network to a query fragment that is running on a |
| different host. Impala designates 3 hosts to each produce non-local broadcast filters, to guard against the |
| possibility of a single slow host taking too long. Depending on the setting of the <codeph>RUNTIME_FILTER_MODE</codeph> query option |
| (<codeph>LOCAL</codeph> or <codeph>GLOBAL</codeph>), Impala either uses a conservative optimization |
| strategy where filters are only consumed on the same host that produced them, or a more aggressive strategy |
| where filters are eligible to be transmitted across the network. |
| </p> |
| |
| <note rev="2.6.0 IMPALA-3333"> |
| In <keyword keyref="impala26_full"/> and higher, the default for runtime filtering is the <codeph>GLOBAL</codeph> setting. |
| </note> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="runtime_filtering_file_formats"> |
| <title>File Format Considerations for Runtime Filtering</title> |
| <conbody> |
| <p> |
| Parquet tables get the most benefit from |
| the runtime filtering optimizations. Runtime filtering can speed up |
| join queries against partitioned or unpartitioned Parquet tables, |
| and single-table queries against partitioned Parquet tables. |
| See <xref href="impala_parquet.xml#parquet"/> for information about |
| using Parquet tables with Impala. |
| </p> |
| <p> |
| For other file formats (text, Avro, RCFile, and SequenceFile), |
| runtime filtering speeds up queries against partitioned tables only. |
| Because partitioned tables can use a mixture of formats, Impala produces |
| the filters in all cases, even if they are not ultimately used to |
| optimize the query. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="runtime_filtering_timing"> |
| <title>Wait Intervals for Runtime Filters</title> |
| <conbody> |
| <p> |
| Because it takes time to produce runtime filters, especially for |
| partitioned filters that must be combined by the coordinator node, |
| there is a time interval above which it is more efficient for |
| the scan nodes to go ahead and construct their intermediate result sets, |
| even if that intermediate data is larger than optimal. If it only takes |
| a few seconds to produce the filters, it is worth the extra time if pruning |
| the unnecessary data can save minutes in the overall query time. |
| You can specify the maximum wait time in milliseconds using the |
| <codeph>RUNTIME_FILTER_WAIT_TIME_MS</codeph> query option. |
| </p> |
| <p> |
| The time is counted from the start of executing the query — see the query |
| option's doc page for details. |
| If all filters have not arrived within the specified interval, the scan node |
| proceeds, using whatever filters did arrive to help avoid reading unnecessary |
| data. If a filter arrives after the scan node begins reading data, the scan node |
| applies that filter to the data that is read after the filter arrives, but not to |
| the data that was already read. |
| </p> |
| <p> |
| If the cluster is relatively busy and your workload contains many |
| resource-intensive or long-running queries, consider increasing the wait time |
| so that complicated queries do not miss opportunities for optimization. |
| If the cluster is lightly loaded and your workload contains many small queries |
| taking only a few seconds, consider decreasing the wait time to avoid the |
| 1 second delay for each query. |
| </p> |
| </conbody> |
| </concept> |
| |
| |
| <concept id="runtime_filtering_query_options"> |
| <title>Query Options for Runtime Filtering</title> |
| <conbody> |
| <p> |
| See the following sections for information about the query options that control runtime filtering: |
| </p> |
| <ul> |
| <li> |
| <p> |
| The first query option adjusts the <q>sensitivity</q> of this feature. |
| <ph rev="2.6.0 IMPALA-3333">By default, it is set to the highest level (<codeph>GLOBAL</codeph>). |
| (This default applies to <keyword keyref="impala26_full"/> and higher. |
| In previous releases, the default was <codeph>LOCAL</codeph>.)</ph> |
| </p> |
| <ul> |
| <li> |
| <p> |
| <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/> |
| </p> |
| </li> |
| </ul> |
| </li> |
| <li> |
| <p> |
| The other query options are tuning knobs that you typically only adjust after doing |
| performance testing, and that you might want to change only for the duration of a single |
| expensive query: |
| </p> |
| <ul> |
| <li> |
| <p> |
| <xref href="impala_max_num_runtime_filters.xml#max_num_runtime_filters"/> |
| </p> |
| </li> |
| <li> |
| <p> |
| <xref href="impala_disable_row_runtime_filtering.xml#disable_row_runtime_filtering"/> |
| </p> |
| </li> |
| <li> |
| <p rev="2.6.0 IMPALA-3480"> |
| <xref href="impala_runtime_filter_max_size.xml#runtime_filter_max_size"/> |
| </p> |
| </li> |
| <li> |
| <p rev="2.6.0 IMPALA-3480"> |
| <xref href="impala_runtime_filter_min_size.xml#runtime_filter_min_size"/> |
| </p> |
| </li> |
| <li> |
| <p rev="2.6.0 IMPALA-3007"> |
| <xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size"/>; |
| in <keyword keyref="impala26_full"/> and higher, this setting acts as a fallback when |
| statistics are not available, rather than as a directive. |
| </p> |
| </li> |
| </ul> |
| </li> |
| </ul> |
| </conbody> |
| </concept> |
| |
| <concept id="runtime_filtering_explain_plan"> |
| <title>Runtime Filtering and Query Plans</title> |
| <conbody> |
| <p> |
| In the same way the query plan displayed by the |
| <codeph>EXPLAIN</codeph> statement includes information |
| about predicates used by each plan fragment, it also |
| includes annotations showing whether a plan fragment |
| produces or consumes a runtime filter. |
| A plan fragment that produces a filter includes an |
| annotation such as |
| <codeph>runtime filters: <varname>filter_id</varname> <- <varname>table</varname>.<varname>column</varname></codeph>, |
| while a plan fragment that consumes a filter includes an annotation such as |
| <codeph>runtime filters: <varname>filter_id</varname> -> <varname>table</varname>.<varname>column</varname></codeph>. |
| <ph rev="2.11.0 IMPALA-4252">Setting the query option <codeph>EXPLAIN_LEVEL=2</codeph> adds additional |
| annotations showing the type of the filter, either <codeph><varname>filter_id</varname>[bloom]</codeph> |
| (for HDFS-based tables) or <codeph><varname>filter_id</varname>[min_max]</codeph> (for Kudu tables).</ph> |
| </p> |
| |
| <p> |
| The following example shows a query that uses a single runtime filter, |
| labeled <codeph>RF000</codeph>, to prune the partitions based on |
| evaluating the result set of a subquery at runtime: |
| </p> |
| |
| <codeblock conref="../shared/impala_common.xml#common/simple_dpp_example"/> |
| |
| <p> |
| The query profile (displayed by the <codeph>PROFILE</codeph> command |
| in <cmdname>impala-shell</cmdname>) contains both the |
| <codeph>EXPLAIN</codeph> plan and more detailed information about the |
| internal workings of the query. The profile output includes the |
| <codeph>Filter routing table</codeph> section with information about |
| each filter based on its ID. |
| </p> |
| </conbody> |
| </concept> |
| |
| <concept id="runtime_filtering_queries"> |
| <title>Examples of Queries that Benefit from Runtime Filtering</title> |
| <conbody> |
| |
| <p> |
| In this example, Impala would normally do extra work to interpret the columns |
| <codeph>C1</codeph>, <codeph>C2</codeph>, <codeph>C3</codeph>, and <codeph>ID</codeph> |
| for each row in <codeph>HUGE_T1</codeph>, before checking the <codeph>ID</codeph> |
| value against the in-memory hash table constructed from all the <codeph>TINY_T2.ID</codeph> |
| values. By producing a filter containing all the <codeph>TINY_T2.ID</codeph> values |
| even before the query starts scanning the <codeph>HUGE_T1</codeph> table, Impala |
| can skip the unnecessary work to parse the column info as soon as it determines |
| that an <codeph>ID</codeph> value does not match any of the values from the other table. |
| </p> |
| |
| <p> |
| The example shows <codeph>COMPUTE STATS</codeph> statements for both the tables (even |
| though that is a one-time operation after loading data into those tables) because |
| Impala relies on up-to-date statistics to |
| determine which one has more distinct <codeph>ID</codeph> values than the other. |
| That information lets Impala make effective decisions about which table to use to |
| construct the in-memory hash table, and which table to read from disk and |
| compare against the entries in the hash table. |
| </p> |
| |
| <codeblock rev="2.6.0"> |
| COMPUTE STATS huge_t1; |
| COMPUTE STATS tiny_t2; |
| SELECT c1, c2, c3 FROM huge_t1 JOIN tiny_t2 WHERE huge_t1.id = tiny_t2.id; |
| </codeblock> |
| |
| <!-- The greater-than comparison prevents runtime filtering from applying. Comment out for now; |
| put back if the example can be reworked in a way that does produce some filters. |
| <p> |
| In this example, <codeph>T1</codeph> is a table partitioned by year. The subquery |
| on <codeph>T2</codeph> produces a single value with the <codeph>MIN(year)</codeph> result, |
| and transmits that value as a filter to the plan fragments that are reading from <codeph>T1</codeph>. |
| Any non-matching partitions in <codeph>T1</codeph> are skipped. |
| </p> |
| |
| <codeblock> |
| select c1 from t1 where year > (select min(year) from t2); |
| </codeblock> |
| --> |
| |
| <p> |
| In this example, <codeph>T1</codeph> is a table partitioned by year. The subquery |
| on <codeph>T2</codeph> produces multiple values, and transmits those values as a filter to the plan |
| fragments that are reading from <codeph>T1</codeph>. Any non-matching partitions in <codeph>T1</codeph> |
| are skipped. |
| </p> |
| |
| <codeblock rev="2.6.0"> |
| select c1 from t1 where year in (select distinct year from t2); |
| </codeblock> |
| |
| <p> |
| Now the <codeph>WHERE</codeph> clause contains an additional test that does not apply to |
| the partition key column. |
| A filter on a column that is not a partition key is called a per-row filter. |
| Because per-row filters only apply for Parquet, <codeph>T1</codeph> must be a Parquet table. |
| </p> |
| |
| <p> |
| The subqueries result in two filters being transmitted to |
| the scan nodes that read from <codeph>T1</codeph>. The filter on <codeph>YEAR</codeph> helps the query eliminate |
| entire partitions based on non-matching years. The filter on <codeph>C2</codeph> lets Impala discard |
| rows with non-matching <codeph>C2</codeph> values immediately after reading them. Without runtime filtering, |
| Impala would have to keep the non-matching values in memory, assemble <codeph>C1</codeph>, <codeph>C2</codeph>, |
| and <codeph>C3</codeph> into rows in the intermediate result set, and transmit all the intermediate rows |
| back to the coordinator node, where they would be eliminated only at the very end of the query. |
| </p> |
| |
| <codeblock rev="2.6.0"> |
| select c1, c2, c3 from t1 |
| where year in (select distinct year from t2) |
| and c2 in (select other_column from t3); |
| </codeblock> |
| |
| <p> |
| This example involves a broadcast join. |
| The fact that the <codeph>ON</codeph> clause would |
| return a small number of matching rows (because there |
| are not very many rows in <codeph>TINY_T2</codeph>) |
| means that the corresponding filter is very selective. |
| Therefore, runtime filtering will probably be effective |
| in optimizing this query. |
| </p> |
| |
| <codeblock rev="2.6.0"> |
| select c1 from huge_t1 join [broadcast] tiny_t2 |
| on huge_t1.id = tiny_t2.id |
| where huge_t1.year in (select distinct year from tiny_t2) |
| and c2 in (select other_column from t3); |
| </codeblock> |
| |
| <p> |
| This example involves a shuffle or partitioned join. |
| Assume that most rows in <codeph>HUGE_T1</codeph> |
| have a corresponding row in <codeph>HUGE_T2</codeph>. |
| The fact that the <codeph>ON</codeph> clause could |
| return a large number of matching rows means that |
| the corresponding filter would not be very selective. |
| Therefore, runtime filtering might be less effective |
| in optimizing this query. |
| </p> |
| |
| <codeblock rev="2.6.0"> |
| select c1 from huge_t1 join [shuffle] huge_t2 |
| on huge_t1.id = huge_t2.id |
| where huge_t1.year in (select distinct year from huge_t2) |
| and c2 in (select other_column from t3); |
| </codeblock> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="runtime_filtering_tuning"> |
| <title>Tuning and Troubleshooting Queries that Use Runtime Filtering</title> |
| <conbody> |
| <p> |
| These tuning and troubleshooting procedures apply to queries that are |
| resource-intensive enough, long-running enough, and frequent enough |
| that you can devote special attention to optimizing them individually. |
| </p> |
| |
| <p> |
| Use the <codeph>EXPLAIN</codeph> statement and examine the <codeph>runtime filters:</codeph> |
| lines to determine whether runtime filters are being applied to the <codeph>WHERE</codeph> predicates |
| and join clauses that you expect. For example, runtime filtering does not apply to queries that use |
| the nested loop join mechanism due to non-equijoin operators. |
| </p> |
| |
| <p> |
| Make sure statistics are up-to-date for all tables involved in the queries. |
| Use the <codeph>COMPUTE STATS</codeph> statement after loading data into non-partitioned tables, |
| and <codeph>COMPUTE INCREMENTAL STATS</codeph> after adding new partitions to partitioned tables. |
| </p> |
| |
| <p> |
| If join queries involving large tables use unique columns as the join keys, |
| for example joining a primary key column with a foreign key column, the overhead of |
| producing and transmitting the filter might outweigh the performance benefit because |
| not much data could be pruned during the early stages of the query. |
| For such queries, consider setting the query option <codeph>RUNTIME_FILTER_MODE=OFF</codeph>. |
| </p> |
| |
| </conbody> |
| </concept> |
| |
| <concept id="runtime_filtering_limits"> |
| <title>Limitations and Restrictions for Runtime Filtering</title> |
| <conbody> |
| <p> |
| The runtime filtering feature is most effective for the Parquet file formats. |
| For other file formats, filtering only applies for partitioned tables. |
| See <xref href="impala_runtime_filtering.xml#runtime_filtering_file_formats"/>. |
| For the ways in which runtime filtering works for Kudu tables, see |
| <xref href="impala_kudu.xml#kudu_performance"/>. |
| </p> |
| |
| <!-- To do: check if this restriction is lifted in 5.8 / 2.6. --> |
| <p rev="IMPALA-3054"> |
| When the spill-to-disk mechanism is activated on a particular host during a query, |
| that host does not produce any filters while processing that query. |
| This limitation does not affect the correctness of results; it only reduces the |
| amount of optimization that can be applied to the query. |
| </p> |
| |
| </conbody> |
| </concept> |
| |
| |
| </concept> |