docs/topics/impala_runtime_filtering.xml - impala - Git at Google

 <?xml version="1.0" encoding="UTF-8"?>
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
 or more contributor license agreements.  See the NOTICE file
 distributed with this work for additional information
 regarding copyright ownership.  The ASF licenses this file
 to you under the Apache License, Version 2.0 (the
 "License"); you may not use this file except in compliance
 with the License.  You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing,
 software distributed under the License is distributed on an
 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 KIND, either express or implied.  See the License for the
 specific language governing permissions and limitations
 under the License.
 -->
 <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd">
 <concept id="runtime_filtering" rev="2.5.0">

   <title id="runtime_filters">Runtime Filtering for Impala Queries (<keyword keyref="impala25"/> or higher only)</title>
   <titlealts audience="PDF"><navtitle>Runtime Filtering</navtitle></titlealts>
   <prolog>
     <metadata>
       <data name="Category" value="Impala"/>
       <data name="Category" value="SQL"/>
       <data name="Category" value="Querying"/>
       <data name="Category" value="Performance"/>
       <data name="Category" value="Developers"/>
       <data name="Category" value="Data Analysts"/>
     </metadata>
   </prolog>

   <conbody>

     <p rev="2.5.0">
       <indexterm audience="hidden">runtime filtering</indexterm>
       <term>Runtime filtering</term> is a wide-ranging optimization feature available in
       <keyword keyref="impala25_full"/> and higher. When only a fraction of the data in a table is
       needed for a query against a partitioned table or to evaluate a join condition,
       Impala determines the appropriate conditions while the query is running, and
       broadcasts that information to all the <cmdname>impalad</cmdname> nodes that are reading the table
       so that they can avoid unnecessary I/O to read partition data, and avoid
       unnecessary network transmission by sending only the subset of rows that match the join keys
       across the network.
     </p>

     <p>
       This feature is primarily used to optimize queries against large partitioned tables
       (under the name <term>dynamic partition pruning</term>) and joins of large tables.
       The information in this section includes concepts, internals, and troubleshooting
       information for the entire runtime filtering feature.
       For specific tuning steps for partitioned tables,
       <!-- and join queries, -->
       see
       <xref href="impala_partitioning.xml#dynamic_partition_pruning"/>.
       <!-- and <xref href="impala_joins.xml#joins"/>. -->
     </p>

     <note type="important" rev="2.6.0">
       <p rev="2.6.0">
         When this feature made its debut in <keyword keyref="impala25"/>,
         the default setting was <codeph>RUNTIME_FILTER_MODE=LOCAL</codeph>.
         Now the default is <codeph>RUNTIME_FILTER_MODE=GLOBAL</codeph> in <keyword keyref="impala26_full"/> and higher,
         which enables more wide-ranging and ambitious query optimization without requiring you to
         explicitly set any query options.
       </p>
     </note>

     <p outputclass="toc inpage"/>

   </conbody>

   <concept id="runtime_filtering_concepts">
     <title>Background Information for Runtime Filtering</title>
     <conbody>
       <p>
         To understand how runtime filtering works at a detailed level, you must
         be familiar with some terminology from the field of distributed database technology:
       </p>
       <ul>
         <li>
           <p> What a <term>plan fragment</term> is. Impala decomposes each query
             into smaller units of work that are distributed across the cluster.
             Wherever possible, a data block is read, filtered, and aggregated by
             plan fragments executing on the same host. For some operations, such
             as joins and combining intermediate results into a final result set,
             data is transmitted across the network from one Impala daemon to
             another. </p>
         </li>
         <li>
           <p>
             What <codeph>SCAN</codeph> and <codeph>HASH JOIN</codeph> plan nodes are, and their role in computing query results:
           </p>
           <p>
             In the Impala query plan, a <term>scan node</term> performs the I/O to read from the underlying data files.
             Although this is an expensive operation from the traditional database perspective, Hadoop clusters and Impala are
             optimized to do this kind of I/O in a highly parallel fashion. The major potential cost savings come from using
             the columnar Parquet format (where Impala can avoid reading data for unneeded columns) and partitioned tables
             (where Impala can avoid reading data for unneeded partitions).
           </p>
           <p>
             Most Impala joins use the
             <xref href="https://en.wikipedia.org/wiki/Hash_join" scope="external" format="html"><term>hash join</term></xref>
             mechanism. (It is only fairly recently that Impala
             started using the nested-loop join technique, for certain kinds of non-equijoin queries.)
             In a hash join, when evaluating join conditions from two tables, Impala constructs a hash table in memory with all
             the different column values from the table on one side of the join.
             Then, for each row from the table on the other side of the join, Impala tests whether the relevant column values
             are in this hash table or not.
           </p>
           <p>
             A <term>hash join node</term> constructs such an in-memory hash table, then performs the comparisons to
             identify which rows match the relevant join conditions
             and should be included in the result set (or at least sent on to the subsequent intermediate stage of
             query processing). Because some of the input for a hash join might be transmitted across the network from another host,
             it is especially important from a performance perspective to prune out ahead of time any data that is known to be
             irrelevant.
           </p>
           <p>
             The more distinct values are in the columns used as join keys, the larger the in-memory hash table and
             thus the more memory required to process the query.
           </p>
         </li>
         <li>
           <p>
             The difference between a <term>broadcast join</term> and a <term>shuffle join</term>.
             (The Hadoop notion of a shuffle join is sometimes referred to in Impala as a <term>partitioned join</term>.)
             In a broadcast join, the table from one side of the join (typically the smaller table)
             is sent in its entirety to all the hosts involved in the query. Then each host can compare its
             portion of the data from the other (larger) table against the full set of possible join keys.
             In a shuffle join, there is no obvious <q>smaller</q> table, and so the contents of both tables
             are divided up, and corresponding portions of the data are transmitted to each host involved in the query.
             See <xref href="impala_hints.xml#hints"/> for information about how these different kinds of
             joins are processed.
           </p>
         </li>
         <li>
           <p>
             The notion of the build phase and probe phase when Impala processes a join query.
             The <term>build phase</term> is where the rows containing the join key columns, typically for the smaller table,
             are transmitted across the network and built into an in-memory hash table data structure on one or
             more destination nodes.
             The <term>probe phase</term> is where data is read locally (typically from the larger table) and the join key columns
             are compared to the values in the in-memory hash table.
             The corresponding input sources (tables, subqueries, and so on) for these
             phases are referred to as the <term>build side</term> and the <term>probe side</term>.
           </p>
         </li>
         <li>
           <p>
             How to set Impala query options: interactively within an <cmdname>impala-shell</cmdname> session through
             the <codeph>SET</codeph> command, for a JDBC or ODBC application through the <codeph>SET</codeph> statement, or
             globally for all <cmdname>impalad</cmdname> daemons through the <codeph>default_query_options</codeph> configuration
             setting.
           </p>
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="runtime_filtering_internals">
     <title>Runtime Filtering Internals</title>
     <conbody>
       <p>
         The <term>filter</term> that is transmitted between plan fragments is essentially a list
         of values for join key columns. When this list of values is transmitted in time to a scan node,
         Impala can filter out non-matching values immediately after reading them, rather than transmitting
         the raw data to another host to compare against the in-memory hash table on that host.
       </p>
       <p>
         For HDFS-based tables, this data structure is implemented as a <term>Bloom filter</term>, which uses
         a probability-based algorithm to determine all possible matching values. (The probability-based aspects
         means that the filter might include some non-matching values, but if so, that does not cause any inaccuracy
         in the final results.)
       </p>
       <p rev="2.11.0 IMPALA-4252">
         Another kind of filter is the <q>min-max</q> filter. It currently only applies to Kudu tables. The
         filter is a data structure representing a minimum and maximum value. These filters are passed to
         Kudu to reduce the number of rows returned to Impala when scanning the probe side of the join.
       </p>
       <p>
         There are different kinds of filters to match the different kinds of joins (partitioned and broadcast).
         A broadcast filter reflects the complete list of relevant values and can be immediately evaluated by a scan node.
         A partitioned filter reflects only the values processed by one host in the
         cluster; all the partitioned filters must be combined into one (by the coordinator node) before the
         scan nodes can use the results to accurately filter the data as it is read from storage.
       </p>
       <p>
         Broadcast filters are also classified as local or global. With a local broadcast filter, the information
         in the filter is used by a subsequent query fragment that is running on the same host that produced the filter.
         A non-local broadcast filter must be transmitted across the network to a query fragment that is running on a
         different host. Impala designates 3 hosts to each produce non-local broadcast filters, to guard against the
         possibility of a single slow host taking too long. Depending on the setting of the <codeph>RUNTIME_FILTER_MODE</codeph> query option
         (<codeph>LOCAL</codeph> or <codeph>GLOBAL</codeph>), Impala either uses a conservative optimization
         strategy where filters are only consumed on the same host that produced them, or a more aggressive strategy
         where filters are eligible to be transmitted across the network.
       </p>

       <note rev="2.6.0 IMPALA-3333">
         In <keyword keyref="impala26_full"/> and higher, the default for runtime filtering is the <codeph>GLOBAL</codeph> setting.
       </note>

     </conbody>
   </concept>

   <concept id="runtime_filtering_file_formats">
     <title>File Format Considerations for Runtime Filtering</title>
     <conbody>
       <p>
         Parquet tables get the most benefit from
         the runtime filtering optimizations. Runtime filtering can speed up
         join queries against partitioned or unpartitioned Parquet tables,
         and single-table queries against partitioned Parquet tables.
         See <xref href="impala_parquet.xml#parquet"/> for information about
         using Parquet tables with Impala.
       </p>
       <p>
         For other file formats (text, Avro, RCFile, and SequenceFile),
         runtime filtering speeds up queries against partitioned tables only.
         Because partitioned tables can use a mixture of formats, Impala produces
         the filters in all cases, even if they are not ultimately used to
         optimize the query.
       </p>
     </conbody>
   </concept>

   <concept id="runtime_filtering_timing">
     <title>Wait Intervals for Runtime Filters</title>
     <conbody>
       <p>
         Because it takes time to produce runtime filters, especially for
         partitioned filters that must be combined by the coordinator node,
         there is a time interval above which it is more efficient for
         the scan nodes to go ahead and construct their intermediate result sets,
         even if that intermediate data is larger than optimal. If it only takes
         a few seconds to produce the filters, it is worth the extra time if pruning
         the unnecessary data can save minutes in the overall query time.
         You can specify the maximum wait time in milliseconds using the
         <codeph>RUNTIME_FILTER_WAIT_TIME_MS</codeph> query option.
       </p>
       <p>
         By default, each scan node waits for up to 1 second (1000 milliseconds)
         for filters to arrive. If all filters have not arrived within the
         specified interval, the scan node proceeds, using whatever filters
         did arrive to help avoid reading unnecessary data. If a filter arrives
         after the scan node begins reading data, the scan node applies that
         filter to the data that is read after the filter arrives, but not to
         the data that was already read.
       </p>
       <p>
         If the cluster is relatively busy and your workload contains many
         resource-intensive or long-running queries, consider increasing the wait time
         so that complicated queries do not miss opportunities for optimization.
         If the cluster is lightly loaded and your workload contains many small queries
         taking only a few seconds, consider decreasing the wait time to avoid the
         1 second delay for each query.
       </p>
     </conbody>
   </concept>


   <concept id="runtime_filtering_query_options">
     <title>Query Options for Runtime Filtering</title>
     <conbody>
       <p>
         See the following sections for information about the query options that control runtime filtering:
       </p>
       <ul>
         <li>
           <p>
             The first query option adjusts the <q>sensitivity</q> of this feature.
             <ph rev="2.6.0 IMPALA-3333">By default, it is set to the highest level (<codeph>GLOBAL</codeph>).
             (This default applies to <keyword keyref="impala26_full"/> and higher.
             In previous releases, the default was <codeph>LOCAL</codeph>.)</ph>
           </p>
           <ul>
             <li>
               <p>
                 <xref href="impala_runtime_filter_mode.xml#runtime_filter_mode"/>
               </p>
             </li>
           </ul>
         </li>
         <li>
           <p>
             The other query options are tuning knobs that you typically only adjust after doing
             performance testing, and that you might want to change only for the duration of a single
             expensive query:
           </p>
           <ul>
             <li>
               <p>
                 <xref href="impala_max_num_runtime_filters.xml#max_num_runtime_filters"/>
               </p>
             </li>
             <li>
               <p>
                 <xref href="impala_disable_row_runtime_filtering.xml#disable_row_runtime_filtering"/>
               </p>
             </li>
             <li>
               <p rev="2.6.0 IMPALA-3480">
                 <xref href="impala_runtime_filter_max_size.xml#runtime_filter_max_size"/>
               </p>
             </li>
             <li>
               <p rev="2.6.0 IMPALA-3480">
                 <xref href="impala_runtime_filter_min_size.xml#runtime_filter_min_size"/>
               </p>
             </li>
             <li>
               <p rev="2.6.0 IMPALA-3007">
                 <xref href="impala_runtime_bloom_filter_size.xml#runtime_bloom_filter_size"/>;
                 in <keyword keyref="impala26_full"/> and higher, this setting acts as a fallback when
                 statistics are not available, rather than as a directive.
               </p>
             </li>
           </ul>
         </li>
       </ul>
     </conbody>
   </concept>

   <concept id="runtime_filtering_explain_plan">
     <title>Runtime Filtering and Query Plans</title>
     <conbody>
       <p>
         In the same way the query plan displayed by the
         <codeph>EXPLAIN</codeph> statement includes information
         about predicates used by each plan fragment, it also
         includes annotations showing whether a plan fragment
         produces or consumes a runtime filter.
         A plan fragment that produces a filter includes an
         annotation such as
         <codeph>runtime filters: <varname>filter_id</varname> &lt;- <varname>table</varname>.<varname>column</varname></codeph>,
         while a plan fragment that consumes a filter includes an annotation such as
         <codeph>runtime filters: <varname>filter_id</varname> -&gt; <varname>table</varname>.<varname>column</varname></codeph>.
         <ph rev="2.11.0 IMPALA-4252">Setting the query option <codeph>EXPLAIN_LEVEL=2</codeph> adds additional
         annotations showing the type of the filter, either <codeph><varname>filter_id</varname>[bloom]</codeph>
         (for HDFS-based tables) or <codeph><varname>filter_id</varname>[min_max]</codeph> (for Kudu tables).</ph>
       </p>

       <p>
         The following example shows a query that uses a single runtime filter,
         labeled <codeph>RF000</codeph>, to prune the partitions based on
         evaluating the result set of a subquery at runtime:
       </p>

 <codeblock conref="../shared/impala_common.xml#common/simple_dpp_example"/>

       <p>
         The query profile (displayed by the <codeph>PROFILE</codeph> command
         in <cmdname>impala-shell</cmdname>) contains both the
           <codeph>EXPLAIN</codeph> plan and more detailed information about the
         internal workings of the query. The profile output includes the
           <codeph>Filter routing table</codeph> section with information about
         each filter based on its ID.
       </p>
     </conbody>
   </concept>

   <concept id="runtime_filtering_queries">
     <title>Examples of Queries that Benefit from Runtime Filtering</title>
     <conbody>

       <p>
         In this example, Impala would normally do extra work to interpret the columns
         <codeph>C1</codeph>, <codeph>C2</codeph>, <codeph>C3</codeph>, and <codeph>ID</codeph>
         for each row in <codeph>HUGE_T1</codeph>, before checking the <codeph>ID</codeph>
         value against the in-memory hash table constructed from all the <codeph>TINY_T2.ID</codeph>
         values. By producing a filter containing all the <codeph>TINY_T2.ID</codeph> values
         even before the query starts scanning the <codeph>HUGE_T1</codeph> table, Impala
         can skip the unnecessary work to parse the column info as soon as it determines
         that an <codeph>ID</codeph> value does not match any of the values from the other table.
       </p>

       <p>
         The example shows <codeph>COMPUTE STATS</codeph> statements for both the tables (even
         though that is a one-time operation after loading data into those tables) because
         Impala relies on up-to-date statistics to
         determine which one has more distinct <codeph>ID</codeph> values than the other.
         That information lets Impala make effective decisions about which table to use to
         construct the in-memory hash table, and which table to read from disk and
         compare against the entries in the hash table.
       </p>

 <codeblock rev="2.6.0">
 COMPUTE STATS huge_t1;
 COMPUTE STATS tiny_t2;
 SELECT c1, c2, c3 FROM huge_t1 JOIN tiny_t2 WHERE huge_t1.id = tiny_t2.id;
 </codeblock>

 <!-- The greater-than comparison prevents runtime filtering from applying. Comment out for now;
      put back if the example can be reworked in a way that does produce some filters.
       <p>
         In this example, <codeph>T1</codeph> is a table partitioned by year. The subquery
         on <codeph>T2</codeph> produces a single value with the <codeph>MIN(year)</codeph> result,
         and transmits that value as a filter to the plan fragments that are reading from <codeph>T1</codeph>.
         Any non-matching partitions in <codeph>T1</codeph> are skipped.
       </p>

 <codeblock>
 select c1 from t1 where year > (select min(year) from t2);
 </codeblock>
 -->

       <p>
         In this example, <codeph>T1</codeph> is a table partitioned by year. The subquery
         on <codeph>T2</codeph> produces multiple values, and transmits those values as a filter to the plan
         fragments that are reading from <codeph>T1</codeph>. Any non-matching partitions in <codeph>T1</codeph>
         are skipped.
       </p>

 <codeblock rev="2.6.0">
 select c1 from t1 where year in (select distinct year from t2);
 </codeblock>

       <p>
         Now the <codeph>WHERE</codeph> clause contains an additional test that does not apply to
         the partition key column.
         A filter on a column that is not a partition key is called a per-row filter.
         Because per-row filters only apply for Parquet, <codeph>T1</codeph> must be a Parquet table.
       </p>

       <p>
         The subqueries result in two filters being transmitted to
         the scan nodes that read from <codeph>T1</codeph>. The filter on <codeph>YEAR</codeph> helps the query eliminate
         entire partitions based on non-matching years. The filter on <codeph>C2</codeph> lets Impala discard
         rows with non-matching <codeph>C2</codeph> values immediately after reading them. Without runtime filtering,
         Impala would have to keep the non-matching values in memory, assemble <codeph>C1</codeph>, <codeph>C2</codeph>,
         and <codeph>C3</codeph> into rows in the intermediate result set, and transmit all the intermediate rows
         back to the coordinator node, where they would be eliminated only at the very end of the query.
       </p>

 <codeblock rev="2.6.0">
 select c1, c2, c3 from t1
   where year in (select distinct year from t2)
     and c2 in (select other_column from t3);
 </codeblock>

       <p>
         This example involves a broadcast join.
         The fact that the <codeph>ON</codeph> clause would
         return a small number of matching rows (because there
         are not very many rows in <codeph>TINY_T2</codeph>)
         means that the corresponding filter is very selective.
         Therefore, runtime filtering will probably be effective
         in optimizing this query.
       </p>

 <codeblock rev="2.6.0">
 select c1 from huge_t1 join [broadcast] tiny_t2
   on huge_t1.id = tiny_t2.id
   where huge_t1.year in (select distinct year from tiny_t2)
     and c2 in (select other_column from t3);
 </codeblock>

       <p>
         This example involves a shuffle or partitioned join.
         Assume that most rows in <codeph>HUGE_T1</codeph>
         have a corresponding row in <codeph>HUGE_T2</codeph>.
         The fact that the <codeph>ON</codeph> clause could
         return a large number of matching rows means that
         the corresponding filter would not be very selective.
         Therefore, runtime filtering might be less effective
         in optimizing this query.
       </p>

 <codeblock rev="2.6.0">
 select c1 from huge_t1 join [shuffle] huge_t2
   on huge_t1.id = huge_t2.id
   where huge_t1.year in (select distinct year from huge_t2)
     and c2 in (select other_column from t3);
 </codeblock>

     </conbody>
   </concept>

   <concept id="runtime_filtering_tuning">
     <title>Tuning and Troubleshooting Queries that Use Runtime Filtering</title>
     <conbody>
       <p>
         These tuning and troubleshooting procedures apply to queries that are
         resource-intensive enough, long-running enough, and frequent enough
         that you can devote special attention to optimizing them individually.
       </p>

       <p>
         Use the <codeph>EXPLAIN</codeph> statement and examine the <codeph>runtime filters:</codeph>
         lines to determine whether runtime filters are being applied to the <codeph>WHERE</codeph> predicates
         and join clauses that you expect. For example, runtime filtering does not apply to queries that use
         the nested loop join mechanism due to non-equijoin operators.
       </p>

       <p>
         Make sure statistics are up-to-date for all tables involved in the queries.
         Use the <codeph>COMPUTE STATS</codeph> statement after loading data into non-partitioned tables,
         and <codeph>COMPUTE INCREMENTAL STATS</codeph> after adding new partitions to partitioned tables.
       </p>

       <p>
         If join queries involving large tables use unique columns as the join keys,
         for example joining a primary key column with a foreign key column, the overhead of
         producing and transmitting the filter might outweigh the performance benefit because
         not much data could be pruned during the early stages of the query.
         For such queries, consider setting the query option <codeph>RUNTIME_FILTER_MODE=OFF</codeph>.
       </p>

     </conbody>
   </concept>

   <concept id="runtime_filtering_limits">
     <title>Limitations and Restrictions for Runtime Filtering</title>
     <conbody>
       <p>
         The runtime filtering feature is most effective for the Parquet file formats.
         For other file formats, filtering only applies for partitioned tables.
         See <xref href="impala_runtime_filtering.xml#runtime_filtering_file_formats"/>.
         For the ways in which runtime filtering works for Kudu tables, see
         <xref href="impala_kudu.xml#kudu_performance"/>.
       </p>

       <!-- To do: check if this restriction is lifted in 5.8 / 2.6. -->
       <p rev="IMPALA-3054">
         When the spill-to-disk mechanism is activated on a particular host during a query,
         that host does not produce any filters while processing that query.
         This limitation does not affect the correctness of results; it only reduces the
         amount of optimization that can be applied to the query.
       </p>

     </conbody>
   </concept>


 </concept>