| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept id="default_join_distribution_mode" rev="2.9.0 IMPALA-5381 IMPALA-5583"> |
| |
| <title>DEFAULT_JOIN_DISTRIBUTION_MODE Query Option</title> |
| <titlealts audience="PDF"><navtitle>DEFAULT_JOIN_DISTRIBUTION_MODE</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="Impala Query Options"/> |
| <data name="Category" value="Performance"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p> |
| <indexterm audience="hidden">DEFAULT_JOIN_DISTRIBUTION_MODE query option</indexterm> |
| This option determines the join distribution that Impala uses when any of the tables |
| involved in a join query is missing statistics. |
| </p> |
| |
| <p> |
| Impala optimizes join queries based on the presence of table statistics, |
| which are produced by the Impala <codeph>COMPUTE STATS</codeph> statement. |
| By default, when a table involved in the join query does not have statistics, |
| Impala uses the <q>broadcast</q> technique that transmits the entire contents |
| of the table to all executor nodes participating in the query. If one table |
| involved in a join has statistics and the other does not, the table without |
| statistics is broadcast. If both tables are missing statistics, the table |
| that is referenced second in the join order is broadcast. This behavior |
| is appropriate when the table involved is relatively small, but can lead to |
| excessive network, memory, and CPU overhead if the table being broadcast is |
| large. |
| </p> |
| |
| <p> |
| Because Impala queries frequently involve very large tables, and suboptimal |
| joins for such tables could result in spilling or out-of-memory errors, |
| the setting <codeph>DEFAULT_JOIN_DISTRIBUTION_MODE=SHUFFLE</codeph> lets you |
| override the default behavior. The shuffle join mechanism divides the corresponding rows |
| of each table involved in a join query using a hashing algorithm, and transmits |
| subsets of the rows to other nodes for processing. Typically, this kind of join is |
| more efficient for joins between large tables of similar size. |
| </p> |
| |
| <p> |
| The setting <codeph>DEFAULT_JOIN_DISTRIBUTION_MODE=SHUFFLE</codeph> is |
| recommended when setting up and deploying new clusters, because it is less likely |
| to result in serious consequences such as spilling or out-of-memory errors if |
| the query plan is based on incomplete information. This setting is not the default, |
| to avoid changing the performance characteristics of join queries for clusters that |
| are already tuned for their existing workloads. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/type_integer"/> |
| <p> |
| The allowed values are <codeph>BROADCAST</codeph> (equivalent to 0) |
| or <codeph>SHUFFLE</codeph> (equivalent to 1). |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| <p> |
| The following examples demonstrate appropriate scenarios for each |
| setting of this query option. |
| </p> |
| |
| <codeblock> |
| -- Create a billion-row table. |
| create table big_table stored as parquet |
| as select * from huge_table limit 1e9; |
| |
| -- For a big table with no statistics, the |
| -- shuffle join mechanism is appropriate. |
| set default_join_distribution_mode=shuffle; |
| |
| ...join queries involving the big table... |
| </codeblock> |
| |
| <codeblock> |
| -- Create a hundred-row table. |
| create table tiny_table stored as parquet |
| as select * from huge_table limit 100; |
| |
| -- For a tiny table with no statistics, the |
| -- broadcast join mechanism is appropriate. |
| set default_join_distribution_mode=broadcast; |
| |
| ...join queries involving the tiny table... |
| </codeblock> |
| |
| <codeblock> |
| compute stats tiny_table; |
| compute stats big_table; |
| |
| -- Once the stats are computed, the query option has |
| -- no effect on join queries involving these tables. |
| -- Impala can determine the absolute and relative sizes |
| -- of each side of the join query by examining the |
| -- row size, cardinality, and so on of each table. |
| |
| ...join queries involving both of these tables... |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| <p> |
| <xref keyref="compute_stats"/>, |
| <xref keyref="joins"/>, |
| <xref keyref="perf_joins"/> |
| </p> |
| |
| </conbody> |
| </concept> |