| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept rev="2.0.0" id="subqueries"> |
| |
| <title>Subqueries in Impala SELECT Statements</title> |
| <titlealts audience="PDF"><navtitle>Subqueries</navtitle></titlealts> |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Querying"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Data Analysts"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p rev="2.0.0"> |
| <indexterm audience="hidden">subqueries</indexterm> |
| A <term>subquery</term> is a query that is nested within another query. Subqueries let queries on one table |
| dynamically adapt based on the contents of another table. This technique provides great flexibility and |
| expressive power for SQL queries. |
| </p> |
| |
| <p> |
| A subquery can return a result set for use in the <codeph>FROM</codeph> or <codeph>WITH</codeph> clauses, or |
| with operators such as <codeph>IN</codeph> or <codeph>EXISTS</codeph>. |
| </p> |
| |
| <p> |
| A <term>scalar subquery</term> produces a result set with a single row containing a single column, typically |
| produced by an aggregation function such as <codeph>MAX()</codeph> or <codeph>SUM()</codeph>. This single |
| result value can be substituted in scalar contexts such as arguments to comparison operators. If the result |
| set is empty, the value of the scalar subquery is <codeph>NULL</codeph>. For example, the following query |
| finds the maximum value of <codeph>T2.Y</codeph> and then substitutes that value into the |
| <codeph>WHERE</codeph> clause of the outer block that queries <codeph>T1</codeph>: |
| </p> |
| |
| <codeblock>SELECT x FROM t1 WHERE x > (SELECT MAX(y) FROM t2); |
| </codeblock> |
| |
| <p> |
| <term>Uncorrelated subqueries</term> do not refer to any tables from the outer block of the query. The same |
| value or set of values produced by the subquery is used when evaluating each row from the outer query block. |
| In this example, the subquery returns an arbitrary number of values from <codeph>T2.Y</codeph>, and each |
| value of <codeph>T1.X</codeph> is tested for membership in that same set of values: |
| </p> |
| |
| <codeblock>SELECT x FROM t1 WHERE x IN (SELECT y FROM t2); |
| </codeblock> |
| |
| <p> |
| <term>Correlated subqueries</term> compare one or more values from the outer query block to values referenced |
| in the <codeph>WHERE</codeph> clause of the subquery. Each row evaluated by the outer <codeph>WHERE</codeph> |
| clause can be evaluated using a different set of values. These kinds of subqueries are restricted in the |
| kinds of comparisons they can do between columns of the inner and outer tables. (See the following |
| <b>Restrictions</b> item.) |
| </p> |
| |
| <p> |
| For example, the following query finds all the employees with salaries that are higher than average for their |
| department. The subquery potentially computes a different <codeph>AVG()</codeph> value for each employee. |
| </p> |
| |
| <!-- TK: Construct an EMPLOYEES schema to try out examples like these. --> |
| |
| <codeblock>SELECT employee_name, employee_id FROM employees one WHERE |
| salary > (SELECT avg(salary) FROM employees two WHERE one.dept_id = two.dept_id); |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <p> |
| <b>Subquery in the <codeph>FROM</codeph> clause:</b> |
| </p> |
| |
| <codeblock>SELECT <varname>select_list</varname> FROM <varname>table_ref</varname> [, <varname>table_ref</varname> ...] |
| |
| <varname>table_ref</varname> ::= <varname>table_name</varname> | (<varname>select_statement</varname>) |
| </codeblock> |
| |
| <p> |
| <b>Subqueries in <codeph>WHERE</codeph> clause:</b> |
| </p> |
| |
| <codeblock>WHERE <varname>value</varname> <varname>comparison_operator</varname> (<varname>scalar_select_statement</varname>) |
| WHERE <varname>value</varname> [NOT] IN (<varname>select_statement</varname>) |
| WHERE [NOT] EXISTS (<varname>correlated_select_statement</varname>) |
| WHERE NOT EXISTS (<varname>correlated_select_statement</varname>) |
| </codeblock> |
| |
| <p> |
| <codeph>comparison_operator</codeph> is a numeric comparison such as <codeph>=</codeph>, |
| <codeph><=</codeph>, <codeph>!=</codeph>, and so on, or a string comparison operator such as |
| <codeph>LIKE</codeph> or <codeph>REGEXP</codeph>. |
| </p> |
| |
| <p rev="2.0.0"> |
| Although you can use non-equality comparison operators such as <codeph><</codeph> or |
| <codeph>>=</codeph>, the subquery must include at least one equality comparison between the columns of the |
| inner and outer query blocks. |
| </p> |
| |
| <p> |
| All syntax is available for both correlated and uncorrelated queries, except that the <codeph>NOT |
| EXISTS</codeph> clause cannot be used with an uncorrelated subquery. |
| </p> |
| |
| <p> |
| Impala subqueries can be nested arbitrarily deep. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/sql1999"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| This example illustrates how subqueries can be used in the <codeph>FROM</codeph> clause to organize the table |
| names, column names, and column values by producing intermediate result sets, especially for join queries. |
| </p> |
| |
| <codeblock>SELECT avg(t1.x), max(t2.y) FROM |
| (SELECT id, cast(a AS DECIMAL(10,5)) AS x FROM raw_data WHERE a BETWEEN 0 AND 100) AS t1 |
| JOIN |
| (SELECT id, length(s) AS y FROM raw_data WHERE s LIKE 'A%') AS t2; |
| USING (id); |
| </codeblock> |
| |
| <p rev="2.0.0"> |
| These examples show how a query can test for the existence of values in a separate table using the |
| <codeph>EXISTS()</codeph> operator with a subquery. |
| <!-- |
| Internally, these queries are processed in a way similar to join queries. |
| Because the values from the second table are not part of the result set, the subquery |
| is more efficient than the equivalent join query. |
| --> |
| </p> |
| |
| <p> |
| The following examples show how a value can be compared against a set of values returned by a subquery. |
| </p> |
| |
| <codeblock rev="2.0.0">SELECT count(x) FROM t1 WHERE EXISTS(SELECT 1 FROM t2 WHERE t1.x = t2.y * 10); |
| |
| SELECT x FROM t1 WHERE x IN (SELECT y FROM t2 WHERE state = 'CA'); |
| </codeblock> |
| |
| <p> |
| The following examples demonstrate scalar subqueries. When a subquery is known to return a single value, you |
| can substitute it where you would normally put a constant value. |
| </p> |
| |
| <codeblock>SELECT x FROM t1 WHERE y = (SELECT max(z) FROM t2); |
| SELECT x FROM t1 WHERE y > (SELECT count(z) FROM t2); |
| </codeblock> |
| |
| <!-- <p conref="../shared/impala_common.xml#common/partitioning_blurb"/> --> |
| |
| <!-- |
| <p conref="../shared/impala_common.xml#common/hbase_blurb"/> |
| <p> |
| Currently, the <codeph>IN (<varname>subquery</varname>)</codeph> operator results in a full table scan |
| of an HBase table, rather than being translated into a series of single-row lookups. |
| Therefore, this is not an efficient construct to use with Impala queries for HBase tables. |
| </p> |
| --> |
| |
| <!-- |
| <p conref="../shared/impala_common.xml#common/parquet_blurb"/> |
| <p conref="../shared/impala_common.xml#common/text_blurb"/> |
| <p conref="../shared/impala_common.xml#common/compatibility_blurb"/> |
| --> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| If the same table is referenced in both the outer and inner query blocks, construct a table alias in the |
| outer query block and use a fully qualified name to distinguish the inner and outer table references: |
| </p> |
| |
| <!-- TK: verify the logic of this example. Probably have other similar ones that could be reused here. --> |
| |
| <codeblock>SELECT * FROM t1 one WHERE id IN (SELECT parent FROM t1 two WHERE t1.parent = t2.id); |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/straight_join_nested_queries"/> |
| |
| <p conref="../shared/impala_common.xml#common/internals_blurb"/> |
| |
| <p> |
| Internally, subqueries involving <codeph>IN</codeph>, <codeph>NOT IN</codeph>, <codeph>EXISTS</codeph>, or |
| <codeph>NOT EXISTS</codeph> clauses are rewritten into join queries. Depending on the syntax, the subquery |
| might be rewritten to an outer join, semi join, cross join, or anti join. |
| </p> |
| |
| <p> |
| A query is processed differently depending on whether the subquery calls any aggregation functions. There are |
| correlated and uncorrelated forms, with and without calls to aggregation functions. Each of these four |
| categories is rewritten differently. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/column_stats_blurb"/> |
| |
| <p> |
| Because queries that include correlated and uncorrelated subqueries in the <codeph>WHERE</codeph> clause are |
| written into join queries, to achieve best performance, follow the same guidelines for running the |
| <codeph>COMPUTE STATS</codeph> statement as you do for tables involved in regular join queries. Run the |
| <codeph>COMPUTE STATS</codeph> statement for each associated tables after loading or substantially changing |
| the data in that table. See <xref href="impala_perf_stats.xml#perf_stats"/> for details. |
| </p> |
| |
| <p> |
| <b>Added in:</b> Subqueries are substantially enhanced starting in Impala 2.0. Now, |
| they can be used in the <codeph>WHERE</codeph> clause, in combination with clauses such as |
| <codeph>EXISTS</codeph> and <codeph>IN</codeph>, rather than just in the <codeph>FROM</codeph> clause. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> |
| |
| <p> |
| The initial Impala support for nested subqueries addresses the most common use cases. Some restrictions |
| remain: |
| </p> |
| |
| <ul> |
| <li> |
| <p> |
| Although you can use subqueries in a query involving <codeph>UNION</codeph> or <codeph>UNION ALL</codeph> |
| in Impala 2.1.0 and higher, currently you cannot construct a union of two subqueries (for example, in the |
| argument of an <codeph>IN</codeph> or <codeph>EXISTS</codeph> operator). |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Subqueries returning scalar values cannot be used with the operators <codeph>ANY</codeph> or |
| <codeph>ALL</codeph>. (Impala does not currently have a <codeph>SOME</codeph> operator, but if it did, |
| the same restriction would apply.) |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| For the <codeph>EXISTS</codeph> and <codeph>NOT EXISTS</codeph> clauses, any subquery comparing values |
| from the outer query block to another table must use at least one equality comparison, not exclusively |
| other kinds of comparisons such as less than, greater than, <codeph>BETWEEN</codeph>, or |
| <codeph>!=</codeph>. |
| </p> |
| </li> |
| |
| <li> |
| <!-- TK: think this is no longer true. --> |
| <p> |
| Currently, a scalar subquery cannot be used as the first or second argument to the |
| <codeph>BETWEEN</codeph> operator. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| A subquery cannot be used inside an <codeph>OR</codeph> conjunction. Expressions inside a subquery, for |
| example in the <codeph>WHERE</codeph> clause, can use <codeph>OR</codeph> conjunctions; the restriction |
| only applies to parts of the query <q>above</q> the subquery. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| Scalar subqueries are only supported in numeric contexts. You cannot use a scalar subquery as an argument |
| to the <codeph>LIKE</codeph>, <codeph>REGEXP</codeph>, or <codeph>RLIKE</codeph> operators, or compare it |
| to a value of a non-numeric type such as <codeph>TIMESTAMP</codeph> or <codeph>BOOLEAN</codeph>. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| <!-- A subquery cannot be used to generate a scalar value for a function call. --> |
| You cannot use subqueries with the <codeph>CASE</codeph> function to generate the comparison value, the |
| values to be compared against, or the return value. |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| A subquery is not allowed in the filter condition for the <codeph>HAVING</codeph> clause. (Strictly |
| speaking, a subquery cannot appear anywhere outside the <codeph>WITH</codeph>, <codeph>FROM</codeph>, and |
| <codeph>WHERE</codeph> clauses.) |
| </p> |
| </li> |
| |
| <li> |
| <p> |
| You must use a fully qualified name |
| (<codeph><varname>table_name</varname>.<varname>column_name</varname></codeph> or |
| <codeph><varname>database_name</varname>.<varname>table_name</varname>.<varname>column_name</varname></codeph>) |
| when referring to any column from the outer query block within a subquery. |
| </p> |
| </li> |
| <li> |
| <p conref="../shared/impala_common.xml#common/tablesample_caveat"/> |
| </li> |
| </ul> |
| |
| <p conref="../shared/impala_common.xml#common/complex_types_blurb"/> |
| |
| <p rev="2.3.0"> |
| For the complex types (<codeph>ARRAY</codeph>, <codeph>STRUCT</codeph>, and <codeph>MAP</codeph>) |
| available in <keyword keyref="impala23_full"/> and higher, the join queries that <q>unpack</q> complex type |
| columns often use correlated subqueries in the <codeph>FROM</codeph> clause. |
| For example, if the first table in the join clause is <codeph>CUSTOMER</codeph>, the second |
| join clause might have a subquery that selects from the column <codeph>CUSTOMER.C_ORDERS</codeph>, |
| which is an <codeph>ARRAY</codeph>. The subquery re-evaluates the <codeph>ARRAY</codeph> elements |
| corresponding to each row from the <codeph>CUSTOMER</codeph> table. |
| See <xref href="impala_complex_types.xml#complex_types"/> for details and examples of |
| using subqueries with complex types. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| <xref href="impala_operators.xml#exists"/>, <xref href="impala_operators.xml#in"/> |
| </p> |
| </conbody> |
| </concept> |