| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="hints"> |
| |
| <title>Query Hints in Impala SELECT Statements</title> |
| <titlealts audience="PDF"><navtitle>Hints</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Performance"/> |
| <data name="Category" value="Troubleshooting"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">hints</indexterm> |
| The Impala SQL dialect supports query hints, for fine-tuning the inner workings of queries. Specify hints as |
| a temporary workaround for expensive queries, where missing statistics or other factors cause inefficient |
| performance. |
| </p> |
| |
| <p> |
| Hints are most often used for the most resource-intensive kinds of Impala queries: |
| </p> |
| |
| <ul> |
| <li> |
| Join queries involving large tables, where intermediate result sets are transmitted across the network to |
| evaluate the join conditions. |
| </li> |
| |
| <li> |
| Inserting into partitioned Parquet tables, where many memory buffers could be allocated on each host to |
| hold intermediate results for each partition. |
| </li> |
| </ul> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <p> |
| You can also represent the hints as keywords surrounded by <codeph>[]</codeph> |
| square brackets; include the brackets in the text of the SQL statement. |
| <note conref="../shared/impala_common.xml#common/square_bracket_hint_caveat"/> |
| </p> |
| |
| <codeblock>SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM |
| <varname>join_left_hand_table</varname> |
| JOIN [{ /* +BROADCAST */ | /* +SHUFFLE */ }] |
| <varname>join_right_hand_table</varname> |
| <varname>remainder_of_query</varname>; |
| |
| INSERT <varname>insert_clauses</varname> |
| [{ /* +SHUFFLE */ | /* +NOSHUFFLE */ }] |
| [<ph rev="IMPALA-2522 2.8.0">/* +CLUSTERED */</ph>] |
| SELECT <varname>remainder_of_query</varname>; |
| </codeblock> |
| |
| <p rev="2.0.0"> |
| In <keyword keyref="impala20_full"/> and higher, you can also specify the hints inside comments that use |
| either the <codeph>/* */</codeph> or <codeph>--</codeph> notation. Specify a <codeph>+</codeph> symbol |
| immediately before the hint name. Recently added hints are only available using the <codeph>/* */</codeph> |
| and <codeph>--</codeph> notation. |
| For clarity, the <codeph>/* */</codeph> and <codeph>--</codeph> styles |
| are used in the syntax and examples throughout this section. |
| With the <codeph>/* */</codeph> or <codeph>--</codeph> notation for |
| hints, specify a <codeph>+</codeph> symbol immediately before the first hint name. |
| Multiple hints can be specified separated by commas, for example |
| <codeph>/* +clustered,shuffle */</codeph> |
| </p> |
| |
| <codeblock rev="2.0.0">SELECT STRAIGHT_JOIN <varname>select_list</varname> FROM |
| <varname>join_left_hand_table</varname> |
| JOIN /* +BROADCAST|SHUFFLE */ |
| <varname>join_right_hand_table</varname> |
| <varname>remainder_of_query</varname>; |
| |
| SELECT <varname>select_list</varname> FROM |
| <varname>join_left_hand_table</varname> |
| JOIN -- +BROADCAST|SHUFFLE |
| <varname>join_right_hand_table</varname> |
| <varname>remainder_of_query</varname>; |
| |
| INSERT <varname>insert_clauses</varname> |
| /* +SHUFFLE|NOSHUFFLE */ |
| SELECT <varname>remainder_of_query</varname>; |
| |
| INSERT <varname>insert_clauses</varname> |
| -- +SHUFFLE|NOSHUFFLE |
| SELECT <varname>remainder_of_query</varname>; |
| |
| <ph rev="IMPALA-2924">SELECT <varname>select_list</varname> FROM |
| <varname>table_ref</varname> |
| /* +{SCHEDULE_CACHE_LOCAL | SCHEDULE_DISK_LOCAL | SCHEDULE_REMOTE} |
| [,RANDOM_REPLICA] */ |
| <varname>remainder_of_query</varname>;</ph> |
| |
| <ph rev="IMPALA-2522 2.8.0">INSERT <varname>insert_clauses</varname> |
| -- +CLUSTERED |
| SELECT <varname>remainder_of_query</varname>; |
| |
| INSERT <varname>insert_clauses</varname> |
| /* +CLUSTERED */ |
| SELECT <varname>remainder_of_query</varname>;</ph> |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| With both forms of hint syntax, include the <codeph>STRAIGHT_JOIN</codeph> |
| keyword immediately after the <codeph>SELECT</codeph> keyword to prevent Impala from |
| reordering the tables in a way that makes the join-related hints ineffective. |
| </p> |
| |
| <p> |
| To reduce the need to use hints, run the <codeph>COMPUTE STATS</codeph> statement against all tables involved |
| in joins, or used as the source tables for <codeph>INSERT ... SELECT</codeph> operations where the |
| destination is a partitioned Parquet table. Do this operation after loading data or making substantial |
| changes to the data within each table. Having up-to-date statistics helps Impala choose more efficient query |
| plans without the need for hinting. See <xref href="impala_perf_stats.xml#perf_stats"/> for details and |
| examples. |
| </p> |
| |
| <p> |
| To see which join strategy is used for a particular query, examine the <codeph>EXPLAIN</codeph> output for |
| that query. See <xref href="impala_explain_plan.xml#perf_explain"/> for details and examples. |
| </p> |
| |
| <p> |
| <b>Hints for join queries:</b> |
| </p> |
| |
| <p> |
| The <codeph>/* +BROADCAST */</codeph> and <codeph>/* +SHUFFLE */</codeph> hints control the execution strategy for join |
| queries. Specify one of the following constructs immediately after the <codeph>JOIN</codeph> keyword in a |
| query: |
| </p> |
| |
| <ul> |
| <li> |
| <codeph>/* +SHUFFLE */</codeph> - Makes that join operation use the <q>partitioned</q> technique, which divides |
| up corresponding rows from both tables using a hashing algorithm, sending subsets of the rows to other |
| nodes for processing. (The keyword <codeph>SHUFFLE</codeph> is used to indicate a <q>partitioned join</q>, |
| because that type of join is not related to <q>partitioned tables</q>.) Since the alternative |
| <q>broadcast</q> join mechanism is the default when table and index statistics are unavailable, you might |
| use this hint for queries where broadcast joins are unsuitable; typically, partitioned joins are more |
| efficient for joins between large tables of similar size. |
| </li> |
| |
| <li> |
| <codeph>/* +BROADCAST */</codeph> - Makes that join operation use the <q>broadcast</q> technique that sends the |
| entire contents of the right-hand table to all nodes involved in processing the join. This is the default |
| mode of operation when table and index statistics are unavailable, so you would typically only need it if |
| stale metadata caused Impala to mistakenly choose a partitioned join operation. Typically, broadcast joins |
| are more efficient in cases where one table is much smaller than the other. (Put the smaller table on the |
| right side of the <codeph>JOIN</codeph> operator.) |
| </li> |
| </ul> |
| |
| <p> |
| <b>Hints for INSERT ... SELECT queries:</b> |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/insert_hints"/> |
| |
| <p rev="IMPALA-2924"> |
| <b>Hints for scheduling of HDFS blocks:</b> |
| </p> |
| |
| <p rev="IMPALA-2924"> |
| The hints <codeph>/* +SCHEDULE_CACHE_LOCAL */</codeph>, |
| <codeph>/* +SCHEDULE_DISK_LOCAL */</codeph>, and |
| <codeph>/* +SCHEDULE_REMOTE */</codeph> have the same effect |
| as specifying the <codeph>REPLICA_PREFERENCE</codeph> query |
| option with the respective option settings of <codeph>CACHE_LOCAL</codeph>, |
| <codeph>DISK_LOCAL</codeph>, or <codeph>REMOTE</codeph>. |
| The hint <codeph>/* +RANDOM_REPLICA */</codeph> is the same as |
| enabling the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> query option. |
| </p> |
| |
| <p rev="IMPALA-2924"> |
| You can use these hints in combination by separating them with commas, |
| for example, <codeph>/* +SCHEDULE_CACHE_LOCAL,RANDOM_REPLICA */</codeph>. |
| See <xref keyref="replica_preference"/> and |
| <xref keyref="schedule_random_replica"/> for information about how |
| these settings influence the way Impala processes HDFS data blocks. |
| </p> |
| |
| <p rev="IMPALA-2924"> |
| Specifying the replica preference as a query hint always overrides the |
| query option setting. Specifying either the <codeph>SCHEDULE_RANDOM_REPLICA</codeph> |
| query option or the corresponding <codeph>RANDOM_REPLICA</codeph> query hint |
| enables the random tie-breaking behavior when processing data blocks |
| during the query. |
| </p> |
| |
| <p> |
| <b>Suggestions versus directives:</b> |
| </p> |
| |
| <p> |
| In early Impala releases, hints were always obeyed and so acted more like directives. Once Impala gained join |
| order optimizations, sometimes join queries were automatically reordered in a way that made a hint |
| irrelevant. Therefore, the hints act more like suggestions in Impala 1.2.2 and higher. |
| </p> |
| |
| <p> |
| To force Impala to follow the hinted execution mechanism for a join query, include the |
| <codeph>STRAIGHT_JOIN</codeph> keyword in the <codeph>SELECT</codeph> statement. See |
| <xref href="impala_perf_joins.xml#straight_join"/> for details. When you use this technique, Impala does not |
| reorder the joined tables at all, so you must be careful to arrange the join order to put the largest table |
| (or subquery result set) first, then the smallest, second smallest, third smallest, and so on. This ordering lets Impala do the |
| most I/O-intensive parts of the query using local reads on the DataNodes, and then reduce the size of the |
| intermediate result set as much as possible as each subsequent table or subquery result set is joined. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> |
| |
| <p> |
| Queries that include subqueries in the <codeph>WHERE</codeph> clause can be rewritten internally as join |
| queries. Currently, you cannot apply hints to the joins produced by these types of queries. |
| </p> |
| |
| <p> |
| Because hints can prevent queries from taking advantage of new metadata or improvements in query planning, |
| use them only when required to work around performance issues, and be prepared to remove them when they are |
| no longer required, such as after a new Impala release or bug fix. |
| </p> |
| |
| <p> |
| In particular, the <codeph>/* +BROADCAST */</codeph> and <codeph>/* +SHUFFLE */</codeph> hints are expected to be |
| needed much less frequently in Impala 1.2.2 and higher, because the join order optimization feature in |
| combination with the <codeph>COMPUTE STATS</codeph> statement now automatically choose join order and join |
| mechanism without the need to rewrite the query and add hints. See |
| <xref href="impala_perf_joins.xml#perf_joins"/> for details. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> |
| |
| <p rev="2.0.0"> |
| The hints embedded within <codeph>--</codeph> comments are compatible with Hive queries. The hints embedded |
| within <codeph>/* */</codeph> comments or <codeph>[ ]</codeph> square brackets are not recognized by or not |
| compatible with Hive. For example, Hive raises an error for Impala hints within <codeph>/* */</codeph> |
| comments because it does not recognize the Impala hint names. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/view_blurb"/> |
| |
| <p rev="2.0.0"> |
| If you use a hint in the query that defines a view, the hint is preserved when you query the view. Impala |
| internally rewrites all hints in views to use the <codeph>--</codeph> comment notation, so that Hive can |
| query such views without errors due to unrecognized hint names. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| For example, this query joins a large customer table with a small lookup table of less than 100 rows. The |
| right-hand table can be broadcast efficiently to all nodes involved in the join. Thus, you would use the |
| <codeph>/* +broadcast */</codeph> hint to force a broadcast join strategy: |
| </p> |
| |
| <codeblock>select straight_join customer.address, state_lookup.state_name |
| from customer join <b>/* +broadcast */</b> state_lookup |
| on customer.state_id = state_lookup.state_id;</codeblock> |
| |
| <p> |
| This query joins two large tables of unpredictable size. You might benchmark the query with both kinds of |
| hints and find that it is more efficient to transmit portions of each table to other nodes for processing. |
| Thus, you would use the <codeph>/* +shuffle */</codeph> hint to force a partitioned join strategy: |
| </p> |
| |
| <codeblock>select straight_join weather.wind_velocity, geospatial.altitude |
| from weather join <b>/* +shuffle */</b> geospatial |
| on weather.lat = geospatial.lat and weather.long = geospatial.long;</codeblock> |
| |
| <p> |
| For joins involving three or more tables, the hint applies to the tables on either side of that specific |
| <codeph>JOIN</codeph> keyword. The <codeph>STRAIGHT_JOIN</codeph> keyword ensures that joins are processed |
| in a predictable order from left to right. For example, this query joins |
| <codeph>t1</codeph> and <codeph>t2</codeph> using a partitioned join, then joins that result set to |
| <codeph>t3</codeph> using a broadcast join: |
| </p> |
| |
| <codeblock>select straight_join t1.name, t2.id, t3.price |
| from t1 join <b>/* +shuffle */</b> t2 join <b>/* +broadcast */</b> t3 |
| on t1.id = t2.id and t2.id = t3.id;</codeblock> |
| |
| <!-- To do: This is a good place to add more sample output showing before and after EXPLAIN plans. --> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| For more background information about join queries, see <xref href="impala_joins.xml#joins"/>. For |
| performance considerations, see <xref href="impala_perf_joins.xml#perf_joins"/>. |
| </p> |
| </conbody> |
| </concept> |