| <?xml version="1.0" encoding="UTF-8"?> |
| <!-- |
| Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| --> |
| <!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> |
| <concept rev="2.0.0" id="analytic_functions"> |
| |
| <title>Impala Analytic Functions</title> |
| |
| <titlealts audience="PDF"> |
| |
| <navtitle>Analytic Functions</navtitle> |
| |
| </titlealts> |
| |
| <prolog> |
| <metadata> |
| <data name="Category" value="Impala"/> |
| <data name="Category" value="SQL"/> |
| <data name="Category" value="Impala Functions"/> |
| <data name="Category" value="Aggregate Functions"/> |
| <data name="Category" value="Analytic Functions"/> |
| <data name="Category" value="Data Analysts"/> |
| <data name="Category" value="Developers"/> |
| <data name="Category" value="Querying"/> |
| </metadata> |
| </prolog> |
| |
| <conbody> |
| |
| <p rev="2.0.0"> Analytic functions (also known as window functions) are a |
| special category of built-in functions. Like aggregate functions, they |
| examine the contents of multiple input rows to compute each output value. |
| However, rather than being limited to one result value per <codeph>GROUP |
| BY</codeph> group, they operate on <term>windows</term> where the input |
| rows are ordered and grouped using flexible conditions expressed through |
| an <codeph>OVER()</codeph> clause. </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <!-- |
| <p> |
| Analytic functions produce one output value for each input row, like scalar functions such as |
| <codeph>length()</codeph> or |
| <codeph>substr()</codeph>. |
| </p> |
| --> |
| |
| <p> |
| Some functions, such as <codeph>LAG()</codeph> and <codeph>RANK()</codeph>, can only be used in this analytic |
| context. Some aggregate functions do double duty: when you call the aggregation functions such as |
| <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, <codeph>AVG()</codeph>, and so on with an |
| <codeph>OVER()</codeph> clause, they produce an output value for each row, based on computations across other |
| rows in the window. |
| </p> |
| |
| <p> |
| Although analytic functions often compute the same value you would see from an aggregate function in a |
| <codeph>GROUP BY</codeph> query, the analytic functions produce a value for each row in the result set rather |
| than a single value for each group. This flexibility lets you include additional columns in the |
| <codeph>SELECT</codeph> list, offering more opportunities for organizing and filtering the result set. |
| </p> |
| |
| <p> |
| Analytic function calls are only allowed in the <codeph>SELECT</codeph> list and in the outermost |
| <codeph>ORDER BY</codeph> clause of the query. During query processing, analytic functions are evaluated |
| after other query stages such as joins, <codeph>WHERE</codeph>, and <codeph>GROUP BY</codeph>, |
| </p> |
| |
| <!-- Oracle doesn't show examples until it gets to the actual functions, so let's follow that lead. |
| <p> |
| The following example shows a very simple call to <codeph>MAX()</codeph> in |
| an analytic context, and a similar query using a <codeph>GROUP BY</codeph> clause. |
| </p> |
| --> |
| |
| <!-- |
| This basic query could be represented either as an analytic |
| function call or an aggregation function call in a <codeph>GROUP BY</codeph> query. |
| For more elaborate kinds of computations, the flexibility of the analytic window |
| makes that the preferred option. |
| --> |
| |
| <!-- TK: construct sample data and fill in query results. --> |
| |
| <!-- Other DB docs don't necessarily include examples up at this level, only for the individual functions. |
| So maybe take these placeholders out entirely. |
| |
| <codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year) FROM historical_temps; |
| SELECT year, month, max(degrees_c) FROM historical_temps GROUP BY year; |
| </codeblock> |
| --> |
| |
| <p> |
| The rows that are part of each partition are analyzed by computations across an ordered or unordered set of |
| rows. For example, <codeph>COUNT()</codeph> and <codeph>SUM()</codeph> might be applied to all the rows in |
| the partition, in which case the order of analysis does not matter. The <codeph>ORDER BY</codeph> clause |
| might be used inside the <codeph>OVER()</codeph> clause to defines the ordering that applies to functions |
| such as <codeph>LAG()</codeph> and <codeph>FIRST_VALUE()</codeph>. |
| </p> |
| |
| <!-- TK: output needed here also. --> |
| |
| <!-- |
| <codeblock>SELECT year, month, max(degrees_c) OVER (PARTITION BY year ORDER BY MONTH DESC) FROM historical_temps; |
| </codeblock> |
| --> |
| |
| <p> |
| Analytic functions are frequently used in fields such as finance and science to provide trend, outlier, and |
| bucketed analysis for large data sets. You might also see the term <q>window functions</q> in database |
| literature, referring to the sequence of rows (the <q>window</q>) that the function call applies to, |
| particularly when the <codeph>OVER</codeph> clause includes a <codeph>ROWS</codeph> or <codeph>RANGE</codeph> |
| keyword. |
| </p> |
| |
| <p> |
| The following sections describe the analytic query clauses and the pure analytic functions provided by |
| Impala. For usage information about aggregate functions in an analytic context, see |
| <xref href="impala_aggregate_functions.xml#aggregate_functions"/>. |
| </p> |
| |
| <p outputclass="toc inpage"/> |
| |
| </conbody> |
| |
| <concept id="over"> |
| |
| <title>OVER Clause</title> |
| |
| <conbody> |
| |
| <p> |
| The <codeph>OVER</codeph> clause is required for calls to pure analytic functions such as |
| <codeph>LEAD()</codeph>, <codeph>RANK()</codeph>, and <codeph>FIRST_VALUE()</codeph>. When you include an |
| <codeph>OVER</codeph> clause with calls to aggregate functions such as <codeph>MAX()</codeph>, |
| <codeph>COUNT()</codeph>, or <codeph>SUM()</codeph>, they operate as analytic functions. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>function(<varname>args</varname>) OVER([<varname>partition_by_clause</varname>] [<varname>order_by_clause</varname> [<varname>window_clause</varname>]]) |
| |
| partition_by_clause ::= PARTITION BY <varname>expr</varname> [, <varname>expr</varname> ...] |
| order_by_clause ::= ORDER BY <varname>expr</varname> [ASC | DESC] [NULLS FIRST | NULLS LAST] [, <varname>expr</varname> [ASC | DESC] [NULLS FIRST | NULLS LAST] ...] |
| window_clause: See <xref href="#window_clause">Window Clause</xref> |
| </codeblock> |
| |
| <p> |
| <b>PARTITION BY clause:</b> |
| </p> |
| |
| <p> |
| The <codeph>PARTITION BY</codeph> clause acts much like the <codeph>GROUP BY</codeph> clause in the |
| outermost block of a query. It divides the rows into groups containing identical values in one or more |
| columns. These logical groups are known as <term>partitions</term>. Throughout the discussion of analytic |
| functions, <q>partitions</q> refers to the groups produced by the <codeph>PARTITION BY</codeph> clause, not |
| to partitioned tables. However, note the following limitation that applies specifically to analytic function |
| calls involving partitioned tables. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/analytic_partition_pruning_caveat"/> |
| |
| <p> |
| The sequence of results from an analytic function <q>resets</q> for each new partition in the result set. |
| That is, the set of preceding or following rows considered by the analytic function always come from a |
| single partition. Any <codeph>MAX()</codeph>, <codeph>SUM()</codeph>, <codeph>ROW_NUMBER()</codeph>, and so |
| on apply to each partition independently. Omit the <codeph>PARTITION BY</codeph> clause to apply the |
| analytic operation to all the rows in the table. |
| </p> |
| |
| <p> |
| <b>ORDER BY clause:</b> |
| </p> |
| |
| <p> |
| The <codeph>ORDER BY</codeph> clause works much like the <codeph>ORDER BY</codeph> clause in the outermost |
| block of a query. It defines the order in which rows are evaluated for the entire input set, or for each |
| group produced by a <codeph>PARTITION BY</codeph> clause. You can order by one or multiple expressions, and |
| for each expression optionally choose ascending or descending order and whether nulls come first or last in |
| the sort order. Because this <codeph>ORDER BY</codeph> clause only defines the order in which rows are |
| evaluated, if you want the results to be output in a specific order, also include an <codeph>ORDER |
| BY</codeph> clause in the outer block of the query. |
| </p> |
| |
| <p> |
| When the <codeph>ORDER BY</codeph> clause is omitted, the analytic function applies to all items in the |
| group produced by the <codeph>PARTITION BY</codeph> clause. When the <codeph>ORDER BY</codeph> clause is |
| included, the analysis can apply to all or a subset of the items in the group, depending on the optional |
| window clause. |
| </p> |
| |
| <p> |
| The order in which the rows are analyzed is only defined for those columns specified in <codeph>ORDER |
| BY</codeph> clauses. |
| </p> |
| |
| <p> |
| One difference between the analytic and outer uses of the <codeph>ORDER BY</codeph> clause: inside the |
| <codeph>OVER</codeph> clause, <codeph>ORDER BY 1</codeph> or other integer value is interpreted as a |
| constant sort value (effectively a no-op) rather than referring to column 1. |
| </p> |
| |
| <p> |
| <b>Window clause:</b> |
| </p> |
| |
| <p> |
| The window clause is only allowed in combination with an <codeph>ORDER BY</codeph> clause. If the |
| <codeph>ORDER BY</codeph> clause is specified but the window clause is not, the default window is |
| <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>. See |
| <xref href="impala_analytic_functions.xml#window_clause"/> for full details. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/hbase_blurb"/> |
| |
| <p> |
| Because HBase tables are optimized for single-row lookups rather than full scans, analytic functions using |
| the <codeph>OVER()</codeph> clause are not recommended for HBase tables. Although such queries work, their |
| performance is lower than on comparable tables using HDFS data files. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/parquet_blurb"/> |
| |
| <p> |
| Analytic functions are very efficient for Parquet tables. The data that is examined during evaluation of |
| the <codeph>OVER()</codeph> clause comes from a specified set of columns, and the values for each column |
| are arranged sequentially within each data file. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/text_blurb"/> |
| |
| <p> |
| Analytic functions are convenient to use with text tables for exploratory business intelligence. When the |
| volume of data is substantial, prefer to use Parquet tables for performance-critical analytic queries. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example shows how to synthesize a numeric sequence corresponding to all the rows in a table. |
| The new table has the same columns as the old one, plus an additional column <codeph>ID</codeph> containing |
| the integers 1, 2, 3, and so on, corresponding to the order of a <codeph>TIMESTAMP</codeph> column in the |
| original table. |
| </p> |
| |
| <!-- TK: synthesize some data and fill in output here. --> |
| |
| <codeblock>CREATE TABLE events_with_id AS |
| SELECT |
| row_number() OVER (ORDER BY date_and_time) AS id, |
| c1, c2, c3, c4 |
| FROM events; |
| </codeblock> |
| |
| <p> |
| The following example shows how to determine the number of rows containing each value for a column. Unlike |
| a corresponding <codeph>GROUP BY</codeph> query, this one can analyze a single column and still return all |
| values (not just the distinct ones) from the other columns. |
| </p> |
| |
| <!-- TK: verify the 'unbounded' shortcut syntax. --> |
| |
| <codeblock>SELECT x, y, z, |
| count() OVER (PARTITION BY x) AS how_many_x |
| FROM t1; |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/restrictions_blurb"/> |
| |
| <p> |
| You cannot directly combine the <codeph>DISTINCT</codeph> operator with analytic function calls. You can |
| put the analytic function call in a <codeph>WITH</codeph> clause or an inline view, and apply the |
| <codeph>DISTINCT</codeph> operator to its result set. |
| </p> |
| |
| <codeblock>WITH t1 AS (SELECT x, sum(x) OVER (PARTITION BY x) AS total FROM t1) |
| SELECT DISTINCT x, total FROM t1; |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="window_clause"> |
| |
| <title>Window Clause</title> |
| |
| <conbody> |
| |
| <p> |
| Certain analytic functions accept an optional <term>window clause</term>, which makes the function analyze |
| only certain rows <q>around</q> the current row rather than all rows in the partition. For example, you can |
| get a moving average by specifying some number of preceding and following rows, or a running count or |
| running total by specifying all rows up to the current position. This clause can result in different |
| analytic results for rows within the same partition. |
| </p> |
| |
| <p> |
| The window clause is supported with the <codeph>AVG()</codeph>, <codeph>COUNT()</codeph>, |
| <codeph>FIRST_VALUE()</codeph>, <codeph>LAST_VALUE()</codeph>, and <codeph>SUM()</codeph> functions. |
| <!-- To do: fill in this factoid under MAX and MIN also. --> |
| For <codeph>MAX()</codeph> and <codeph>MIN()</codeph>, the window clause only allowed if the start bound is |
| <codeph>UNBOUNDED PRECEDING</codeph> |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>ROWS BETWEEN [ { <varname>m</varname> | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } FOLLOWING] ] |
| RANGE BETWEEN [ {<varname>m</varname> | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | <varname>n</varname> } FOLLOWING] ]</codeblock> |
| |
| <p> |
| <codeph>ROWS BETWEEN</codeph> defines the size of the window in terms of the indexes of the rows in the |
| result set. The size of the window is predictable based on the clauses the position within the result set. |
| </p> |
| |
| <p> |
| <codeph>RANGE BETWEEN</codeph> does not currently support numeric arguments to define a variable-size |
| sliding window. |
| <!-- |
| Currently, it effectively works the same as the |
| equivalent <codeph>ROWS BETWEEN</codeph> clause. |
| --> |
| </p> |
| |
| <!-- |
| <p> |
| <codeph>RANGE BETWEEN</codeph> defines the size of the window based on arithmetic comparisons |
| of the values in the result set. |
| The size of the window varies depending on the order and distribution of values. |
| </p> |
| --> |
| |
| <p> |
| Currently, Impala supports only some combinations of arguments to the <codeph>RANGE</codeph> clause: |
| </p> |
| |
| <ul> |
| <li> |
| <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> (the default when <codeph>ORDER |
| BY</codeph> is specified and the window clause is omitted) |
| </li> |
| |
| <li> |
| <codeph>RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING</codeph> |
| </li> |
| |
| <li> |
| <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING</codeph> |
| </li> |
| </ul> |
| |
| <p> |
| When <codeph>RANGE</codeph> is used, <codeph>CURRENT ROW</codeph> includes not just the current row but all |
| rows that are tied with the current row based on the <codeph>ORDER BY</codeph> expressions. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following examples show financial data for a fictional stock symbol <codeph>JDR</codeph>. The closing |
| price moves up and down each day. |
| </p> |
| |
| <codeblock>create table stock_ticker (stock_symbol string, closing_price decimal(8,2), closing_date timestamp); |
| ...load some data... |
| select * from stock_ticker order by stock_symbol, closing_date |
| +--------------+---------------+---------------------+ |
| | stock_symbol | closing_price | closing_date | |
| +--------------+---------------+---------------------+ |
| | JDR | 12.86 | 2014-10-02 00:00:00 | |
| | JDR | 12.89 | 2014-10-03 00:00:00 | |
| | JDR | 12.94 | 2014-10-04 00:00:00 | |
| | JDR | 12.55 | 2014-10-05 00:00:00 | |
| | JDR | 14.03 | 2014-10-06 00:00:00 | |
| | JDR | 14.75 | 2014-10-07 00:00:00 | |
| | JDR | 13.98 | 2014-10-08 00:00:00 | |
| +--------------+---------------+---------------------+ |
| </codeblock> |
| |
| <p> |
| The queries use analytic functions with window clauses to compute moving averages of the closing price. For |
| example, <codeph>ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING</codeph> produces an average of the value from a |
| 3-day span, producing a different value for each row. The first row, which has no preceding row, only gets |
| averaged with the row following it. If the table contained more than one stock symbol, the |
| <codeph>PARTITION BY</codeph> clause would limit the window for the moving average to only consider the |
| prices for a single stock. |
| </p> |
| |
| <codeblock>select stock_symbol, closing_date, closing_price, |
| avg(closing_price) over (partition by stock_symbol order by closing_date |
| rows between 1 preceding and 1 following) as moving_average |
| from stock_ticker; |
| +--------------+---------------------+---------------+----------------+ |
| | stock_symbol | closing_date | closing_price | moving_average | |
| +--------------+---------------------+---------------+----------------+ |
| | JDR | 2014-10-02 00:00:00 | 12.86 | 12.87 | |
| | JDR | 2014-10-03 00:00:00 | 12.89 | 12.89 | |
| | JDR | 2014-10-04 00:00:00 | 12.94 | 12.79 | |
| | JDR | 2014-10-05 00:00:00 | 12.55 | 13.17 | |
| | JDR | 2014-10-06 00:00:00 | 14.03 | 13.77 | |
| | JDR | 2014-10-07 00:00:00 | 14.75 | 14.25 | |
| | JDR | 2014-10-08 00:00:00 | 13.98 | 14.36 | |
| +--------------+---------------------+---------------+----------------+ |
| </codeblock> |
| |
| <p> |
| The clause <codeph>ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> produces a cumulative moving |
| average, from the earliest data up to the value for each day. |
| </p> |
| |
| <codeblock>select stock_symbol, closing_date, closing_price, |
| avg(closing_price) over (partition by stock_symbol order by closing_date |
| rows between unbounded preceding and current row) as moving_average |
| from stock_ticker; |
| +--------------+---------------------+---------------+----------------+ |
| | stock_symbol | closing_date | closing_price | moving_average | |
| +--------------+---------------------+---------------+----------------+ |
| | JDR | 2014-10-02 00:00:00 | 12.86 | 12.86 | |
| | JDR | 2014-10-03 00:00:00 | 12.89 | 12.87 | |
| | JDR | 2014-10-04 00:00:00 | 12.94 | 12.89 | |
| | JDR | 2014-10-05 00:00:00 | 12.55 | 12.81 | |
| | JDR | 2014-10-06 00:00:00 | 14.03 | 13.05 | |
| | JDR | 2014-10-07 00:00:00 | 14.75 | 13.33 | |
| | JDR | 2014-10-08 00:00:00 | 13.98 | 13.42 | |
| +--------------+---------------------+---------------+----------------+ |
| </codeblock> |
| |
| <!-- Matt suggests not always true depending on data. Hiding until I can try myself. |
| <p> |
| The clause <codeph>RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph> would produce the same |
| output as above. Because <codeph>RANGE</codeph> currently does not support numeric offsets while |
| <codeph>ROWS</codeph> does, currently the <codeph>ROWS</codeph> syntax is more flexible. |
| </p> |
| --> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="avg_analytic"> |
| |
| <title>AVG Function - Analytic Context</title> |
| |
| <conbody> |
| |
| <p> |
| You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic |
| function. See <xref href="impala_avg.xml#avg"/> for details and examples. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="count_analytic"> |
| |
| <title>COUNT Function - Analytic Context</title> |
| |
| <conbody> |
| |
| <p> |
| You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic |
| function. See <xref href="impala_count.xml#count"/> for details and examples. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.3.0" id="cume_dist"> |
| |
| <title>CUME_DIST Function (<keyword keyref="impala23"/> or higher only)</title> |
| |
| <conbody> |
| |
| <p> |
| Returns the cumulative distribution of a value. The value for each row in the result set is greater than 0 |
| and less than or equal to 1. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>CUME_DIST (<varname>expr</varname>) |
| OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>) |
| </codeblock> |
| |
| <p> |
| The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The |
| window clause is not allowed. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| Within each partition of the result set, the <codeph>CUME_DIST()</codeph> value represents an ascending |
| sequence that ends at 1. Each value represents the proportion of rows in the partition whose values are |
| less than or equal to the value in the current row. |
| </p> |
| |
| <p> |
| If the sequence of input values contains ties, the <codeph>CUME_DIST()</codeph> results are identical for the |
| tied values. |
| </p> |
| |
| <p> |
| Impala only supports the <codeph>CUME_DIST()</codeph> function in an analytic context, not as a regular |
| aggregate function. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| This example uses a table with 9 rows. The <codeph>CUME_DIST()</codeph> |
| function evaluates the entire table because there is no <codeph>PARTITION BY</codeph> clause, |
| with the rows ordered by the weight of the animal. |
| the sequence of values shows that 1/9 of the values are less than or equal to the lightest |
| animal (mouse), 2/9 of the values are less than or equal to the second-lightest animal, |
| and so on up to the heaviest animal (elephant), where 9/9 of the rows are less than or |
| equal to its weight. |
| </p> |
| |
| <codeblock>create table animals (name string, kind string, kilos decimal(9,3)); |
| insert into animals values |
| ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020), |
| ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5), |
| ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5); |
| |
| select name, cume_dist() over (order by kilos) from animals; |
| +------------+-----------------------+ |
| | name | cume_dist() OVER(...) | |
| +------------+-----------------------+ |
| | Elephant | 1 | |
| | Giraffe | 0.8888888888888888 | |
| | Polar bear | 0.7777777777777778 | |
| | Horse | 0.6666666666666666 | |
| | Ostrich | 0.5555555555555556 | |
| | Condor | 0.4444444444444444 | |
| | Housecat | 0.3333333333333333 | |
| | Owl | 0.2222222222222222 | |
| | Mouse | 0.1111111111111111 | |
| +------------+-----------------------+ |
| </codeblock> |
| |
| <p> |
| Using a <codeph>PARTITION BY</codeph> clause produces a separate sequence for each partition |
| group, in this case one for mammals and one for birds. Because there are 3 birds and 6 mammals, |
| the sequence illustrates how 1/3 of the <q>Bird</q> rows have a <codeph>kilos</codeph> value that is less than or equal to |
| the lightest bird, 1/6 of the <q>Mammal</q> rows have a <codeph>kilos</codeph> value that is less than or equal to |
| the lightest mammal, and so on until both the heaviest bird and heaviest mammal have a <codeph>CUME_DIST()</codeph> |
| value of 1. |
| </p> |
| |
| <codeblock>select name, kind, cume_dist() over (partition by kind order by kilos) from animals |
| +------------+--------+-----------------------+ |
| | name | kind | cume_dist() OVER(...) | |
| +------------+--------+-----------------------+ |
| | Ostrich | Bird | 1 | |
| | Condor | Bird | 0.6666666666666666 | |
| | Owl | Bird | 0.3333333333333333 | |
| | Elephant | Mammal | 1 | |
| | Giraffe | Mammal | 0.8333333333333334 | |
| | Polar bear | Mammal | 0.6666666666666666 | |
| | Horse | Mammal | 0.5 | |
| | Housecat | Mammal | 0.3333333333333333 | |
| | Mouse | Mammal | 0.1666666666666667 | |
| +------------+--------+-----------------------+ |
| </codeblock> |
| |
| <p> |
| We can reverse the ordering within each partition group by using an <codeph>ORDER BY ... DESC</codeph> |
| clause within the <codeph>OVER()</codeph> clause. Now the lightest (smallest value of <codeph>kilos</codeph>) |
| animal of each kind has a <codeph>CUME_DIST()</codeph> value of 1. |
| </p> |
| |
| <codeblock>select name, kind, cume_dist() over (partition by kind order by kilos desc) from animals |
| +------------+--------+-----------------------+ |
| | name | kind | cume_dist() OVER(...) | |
| +------------+--------+-----------------------+ |
| | Owl | Bird | 1 | |
| | Condor | Bird | 0.6666666666666666 | |
| | Ostrich | Bird | 0.3333333333333333 | |
| | Mouse | Mammal | 1 | |
| | Housecat | Mammal | 0.8333333333333334 | |
| | Horse | Mammal | 0.6666666666666666 | |
| | Polar bear | Mammal | 0.5 | |
| | Giraffe | Mammal | 0.3333333333333333 | |
| | Elephant | Mammal | 0.1666666666666667 | |
| +------------+--------+-----------------------+ |
| </codeblock> |
| |
| <p> |
| The following example manufactures some rows with identical values in the <codeph>kilos</codeph> column, |
| to demonstrate how the results look in case of tie values. For simplicity, it only shows the <codeph>CUME_DIST()</codeph> |
| sequence for the <q>Bird</q> rows. Now with 3 rows all with a value of 15, all of those rows have the same |
| <codeph>CUME_DIST()</codeph> value. 4/5 of the rows have a value for <codeph>kilos</codeph> that is less than or |
| equal to 15. |
| </p> |
| |
| <codeblock>insert into animals values ('California Condor', 'Bird', 15), ('Andean Condor', 'Bird', 15) |
| |
| select name, kind, cume_dist() over (order by kilos) from animals where kind = 'Bird'; |
| +-------------------+------+-----------------------+ |
| | name | kind | cume_dist() OVER(...) | |
| +-------------------+------+-----------------------+ |
| | Ostrich | Bird | 1 | |
| | Condor | Bird | 0.8 | |
| | California Condor | Bird | 0.8 | |
| | Andean Condor | Bird | 0.8 | |
| | Owl | Bird | 0.2 | |
| +-------------------+------+-----------------------+ |
| </codeblock> |
| |
| <p> |
| The following example shows how to use an <codeph>ORDER BY</codeph> clause in the outer block |
| to order the result set in case of ties. Here, all the <q>Bird</q> rows are together, then in descending order |
| by the result of the <codeph>CUME_DIST()</codeph> function, and all tied <codeph>CUME_DIST()</codeph> |
| values are ordered by the animal name. |
| </p> |
| |
| <codeblock>select name, kind, cume_dist() over (partition by kind order by kilos) as ordering |
| from animals |
| where |
| kind = 'Bird' |
| order by kind, ordering desc, name; |
| +-------------------+------+----------+ |
| | name | kind | ordering | |
| +-------------------+------+----------+ |
| | Ostrich | Bird | 1 | |
| | Andean Condor | Bird | 0.8 | |
| | California Condor | Bird | 0.8 | |
| | Condor | Bird | 0.8 | |
| | Owl | Bird | 0.2 | |
| +-------------------+------+----------+ |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.0.0" id="dense_rank"> |
| |
| <title>DENSE_RANK Function</title> |
| |
| <conbody> |
| |
| <p> |
| Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers |
| for duplicate values of the <codeph>ORDER BY</codeph> expressions. After generating duplicate output values |
| for the <q>tied</q> input values, the function continues the sequence with the next higher integer. |
| Therefore, the sequence contains duplicates but no gaps when the input contains duplicates. Starts the |
| sequence over for each group produced by the <codeph>PARTITIONED BY</codeph> clause. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>DENSE_RANK() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> |
| |
| <p> |
| The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The |
| window clause is not allowed. |
| </p> |
| |
| <!-- Can make the text for ROW_NUMBER, RANK, and DENSE_RANK identical |
| so it can be conref'ed in all 3 places. --> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| Often used for top-N and bottom-N queries. For example, it could produce a <q>top 10</q> report including |
| all the items with the 10 highest values, even if several items tied for 1st place. |
| </p> |
| |
| <p> |
| Similar to <codeph>ROW_NUMBER</codeph> and <codeph>RANK</codeph>. These functions differ in how they treat |
| duplicate combinations of values. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example demonstrates how the <codeph>DENSE_RANK()</codeph> function identifies where each |
| value <q>places</q> in the result set, producing the same result for duplicate values, but with a strict |
| sequence from 1 to the number of groups. For example, when results are ordered by the <codeph>X</codeph> |
| column, both <codeph>1</codeph> values are tied for first; both <codeph>2</codeph> values are tied for |
| second; and so on. |
| </p> |
| |
| <codeblock>select x, dense_rank() over(order by x) as rank, property from int_t; |
| +----+------+----------+ |
| | x | rank | property | |
| +----+------+----------+ |
| | 1 | 1 | square | |
| | 1 | 1 | odd | |
| | 2 | 2 | even | |
| | 2 | 2 | prime | |
| | 3 | 3 | prime | |
| | 3 | 3 | odd | |
| | 4 | 4 | even | |
| | 4 | 4 | square | |
| | 5 | 5 | odd | |
| | 5 | 5 | prime | |
| | 6 | 6 | even | |
| | 6 | 6 | perfect | |
| | 7 | 7 | lucky | |
| | 7 | 7 | lucky | |
| | 7 | 7 | lucky | |
| | 7 | 7 | odd | |
| | 7 | 7 | prime | |
| | 8 | 8 | even | |
| | 9 | 9 | square | |
| | 9 | 9 | odd | |
| | 10 | 10 | round | |
| | 10 | 10 | even | |
| +----+------+----------+ |
| </codeblock> |
| |
| <p> |
| The following examples show how the <codeph>DENSE_RANK()</codeph> function is affected by the |
| <codeph>PARTITION</codeph> property within the <codeph>ORDER BY</codeph> clause. |
| </p> |
| |
| <p> |
| Partitioning by the <codeph>PROPERTY</codeph> column groups all the even, odd, and so on values together, |
| and <codeph>DENSE_RANK()</codeph> returns the place of each value within the group, producing several |
| ascending sequences. |
| </p> |
| |
| <codeblock>select x, dense_rank() over(partition by property order by x) as rank, property from int_t; |
| +----+------+----------+ |
| | x | rank | property | |
| +----+------+----------+ |
| | 2 | 1 | even | |
| | 4 | 2 | even | |
| | 6 | 3 | even | |
| | 8 | 4 | even | |
| | 10 | 5 | even | |
| | 7 | 1 | lucky | |
| | 7 | 1 | lucky | |
| | 7 | 1 | lucky | |
| | 1 | 1 | odd | |
| | 3 | 2 | odd | |
| | 5 | 3 | odd | |
| | 7 | 4 | odd | |
| | 9 | 5 | odd | |
| | 6 | 1 | perfect | |
| | 2 | 1 | prime | |
| | 3 | 2 | prime | |
| | 5 | 3 | prime | |
| | 7 | 4 | prime | |
| | 10 | 1 | round | |
| | 1 | 1 | square | |
| | 4 | 2 | square | |
| | 9 | 3 | square | |
| +----+------+----------+ |
| </codeblock> |
| |
| <p> |
| Partitioning by the <codeph>X</codeph> column groups all the duplicate numbers together and returns the |
| place each value within the group; because each value occurs only 1 or 2 times, |
| <codeph>DENSE_RANK()</codeph> designates each <codeph>X</codeph> value as either first or second within its |
| group. |
| </p> |
| |
| <codeblock>select x, dense_rank() over(partition by x order by property) as rank, property from int_t; |
| +----+------+----------+ |
| | x | rank | property | |
| +----+------+----------+ |
| | 1 | 1 | odd | |
| | 1 | 2 | square | |
| | 2 | 1 | even | |
| | 2 | 2 | prime | |
| | 3 | 1 | odd | |
| | 3 | 2 | prime | |
| | 4 | 1 | even | |
| | 4 | 2 | square | |
| | 5 | 1 | odd | |
| | 5 | 2 | prime | |
| | 6 | 1 | even | |
| | 6 | 2 | perfect | |
| | 7 | 1 | lucky | |
| | 7 | 1 | lucky | |
| | 7 | 1 | lucky | |
| | 7 | 2 | odd | |
| | 7 | 3 | prime | |
| | 8 | 1 | even | |
| | 9 | 1 | odd | |
| | 9 | 2 | square | |
| | 10 | 1 | even | |
| | 10 | 2 | round | |
| +----+------+----------+ |
| </codeblock> |
| |
| <p> |
| The following example shows how <codeph>DENSE_RANK()</codeph> produces a continuous sequence while still |
| allowing for ties. In this case, Croesus and Midas both have the second largest fortune, while Crassus has |
| the third largest. (In <xref href="impala_analytic_functions.xml#rank"/>, you see a similar query with the |
| <codeph>RANK()</codeph> function that shows that while Crassus has the third largest fortune, he is the |
| fourth richest person.) |
| </p> |
| |
| <codeblock>select dense_rank() over (order by net_worth desc) as placement, name, net_worth from wealth order by placement, name; |
| +-----------+---------+---------------+ |
| | placement | name | net_worth | |
| +-----------+---------+---------------+ |
| | 1 | Solomon | 2000000000.00 | |
| | 2 | Croesus | 1000000000.00 | |
| | 2 | Midas | 1000000000.00 | |
| | 3 | Crassus | 500000000.00 | |
| | 4 | Scrooge | 80000000.00 | |
| +-----------+---------+---------------+ |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| <xref href="impala_analytic_functions.xml#rank"/>, <xref href="impala_analytic_functions.xml#row_number"/> |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.0.0" id="first_value"> |
| |
| <title>FIRST_VALUE Function</title> |
| |
| <conbody> |
| |
| <p> |
| Returns the expression value from the first row in the window. The return value is <codeph>NULL</codeph> if |
| the input expression is <codeph>NULL</codeph>. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>FIRST_VALUE(<varname>expr</varname>) OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname> [<varname>window_clause</varname>])</codeblock> |
| |
| <p> |
| The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The |
| window clause is optional. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| If any duplicate values occur in the tuples evaluated by the <codeph>ORDER BY</codeph> clause, the result |
| of this function is not deterministic. Consider adding additional <codeph>ORDER BY</codeph> columns to |
| ensure consistent ordering. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example shows a table with a wide variety of country-appropriate greetings. For consistency, |
| we want to standardize on a single greeting for each country. The <codeph>FIRST_VALUE()</codeph> function |
| helps to produce a mail merge report where every person from the same country is addressed with the same |
| greeting. |
| </p> |
| |
| <codeblock>select name, country, greeting from mail_merge |
| +---------+---------+--------------+ |
| | name | country | greeting | |
| +---------+---------+--------------+ |
| | Pete | USA | Hello | |
| | John | USA | Hi | |
| | Boris | Germany | Guten tag | |
| | Michael | Germany | Guten morgen | |
| | Bjorn | Sweden | Hej | |
| | Mats | Sweden | Tja | |
| +---------+---------+--------------+ |
| |
| select country, name, |
| first_value(greeting) |
| over (partition by country order by name, greeting) as greeting |
| from mail_merge; |
| +---------+---------+-----------+ |
| | country | name | greeting | |
| +---------+---------+-----------+ |
| | Germany | Boris | Guten tag | |
| | Germany | Michael | Guten tag | |
| | Sweden | Bjorn | Hej | |
| | Sweden | Mats | Hej | |
| | USA | John | Hi | |
| | USA | Pete | Hi | |
| +---------+---------+-----------+ |
| </codeblock> |
| |
| <p> |
| Changing the order in which the names are evaluated changes which greeting is applied to each group. |
| </p> |
| |
| <codeblock>select country, name, |
| first_value(greeting) |
| over (partition by country order by name desc, greeting) as greeting |
| from mail_merge; |
| +---------+---------+--------------+ |
| | country | name | greeting | |
| +---------+---------+--------------+ |
| | Germany | Michael | Guten morgen | |
| | Germany | Boris | Guten morgen | |
| | Sweden | Mats | Tja | |
| | Sweden | Bjorn | Tja | |
| | USA | Pete | Hello | |
| | USA | John | Hello | |
| +---------+---------+--------------+ |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| <xref href="impala_analytic_functions.xml#last_value"/> |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.0.0" id="lag"> |
| |
| <title>LAG Function</title> |
| |
| <conbody> |
| |
| <p> |
| This function returns the value of an expression using column values from a preceding row. You specify an |
| integer offset, which designates a row position some number of rows previous to the current row. Any column |
| references in the expression argument refer to column values from that prior row. Typically, the table |
| contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>LAG (<varname>expr</varname> [, <varname>offset</varname>] [, <varname>default</varname>]) |
| OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> |
| |
| <p> |
| The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The |
| window clause is not allowed. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| Sometimes used an an alternative to doing a self-join. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example uses the same stock data created in <xref href="#window_clause"/>. For each day, the |
| query prints the closing price alongside the previous day's closing price. The first row for each stock |
| symbol has no previous row, so that <codeph>LAG()</codeph> value is <codeph>NULL</codeph>. |
| </p> |
| |
| <codeblock>select stock_symbol, closing_date, closing_price, |
| lag(closing_price,1) over (partition by stock_symbol order by closing_date) as "yesterday closing" |
| from stock_ticker |
| order by closing_date; |
| +--------------+---------------------+---------------+-------------------+ |
| | stock_symbol | closing_date | closing_price | yesterday closing | |
| +--------------+---------------------+---------------+-------------------+ |
| | JDR | 2014-09-13 00:00:00 | 12.86 | NULL | |
| | JDR | 2014-09-14 00:00:00 | 12.89 | 12.86 | |
| | JDR | 2014-09-15 00:00:00 | 12.94 | 12.89 | |
| | JDR | 2014-09-16 00:00:00 | 12.55 | 12.94 | |
| | JDR | 2014-09-17 00:00:00 | 14.03 | 12.55 | |
| | JDR | 2014-09-18 00:00:00 | 14.75 | 14.03 | |
| | JDR | 2014-09-19 00:00:00 | 13.98 | 14.75 | |
| +--------------+---------------------+---------------+-------------------+ |
| </codeblock> |
| |
| <p> |
| The following example does an arithmetic operation between the current row and a value from the previous |
| row, to produce a delta value for each day. This example also demonstrates how <codeph>ORDER BY</codeph> |
| works independently in the different parts of the query. The <codeph>ORDER BY closing_date</codeph> in the |
| <codeph>OVER</codeph> clause makes the query analyze the rows in chronological order. Then the outer query |
| block uses <codeph>ORDER BY closing_date DESC</codeph> to present the results with the most recent date |
| first. |
| </p> |
| |
| <codeblock>select stock_symbol, closing_date, closing_price, |
| cast( |
| closing_price - lag(closing_price,1) over |
| (partition by stock_symbol order by closing_date) |
| as decimal(8,2) |
| ) |
| as "change from yesterday" |
| from stock_ticker |
| order by closing_date desc; |
| +--------------+---------------------+---------------+-----------------------+ |
| | stock_symbol | closing_date | closing_price | change from yesterday | |
| +--------------+---------------------+---------------+-----------------------+ |
| | JDR | 2014-09-19 00:00:00 | 13.98 | -0.76 | |
| | JDR | 2014-09-18 00:00:00 | 14.75 | 0.72 | |
| | JDR | 2014-09-17 00:00:00 | 14.03 | 1.47 | |
| | JDR | 2014-09-16 00:00:00 | 12.55 | -0.38 | |
| | JDR | 2014-09-15 00:00:00 | 12.94 | 0.04 | |
| | JDR | 2014-09-14 00:00:00 | 12.89 | 0.03 | |
| | JDR | 2014-09-13 00:00:00 | 12.86 | NULL | |
| +--------------+---------------------+---------------+-----------------------+ |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| This function is the converse of <xref href="impala_analytic_functions.xml#lead"/>. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.0.0" id="last_value"> |
| |
| <title>LAST_VALUE Function</title> |
| |
| <conbody> |
| |
| <p> |
| Returns the expression value from the last row in the window. This same value is repeated for all result |
| rows for the group. The return value is <codeph>NULL</codeph> if the input expression is |
| <codeph>NULL</codeph>. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>LAST_VALUE(<varname>expr</varname>) OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname> [<varname>window_clause</varname>])</codeblock> |
| |
| <p> |
| The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The |
| window clause is optional. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| If any duplicate values occur in the tuples evaluated by the <codeph>ORDER BY</codeph> clause, the result |
| of this function is not deterministic. Consider adding additional <codeph>ORDER BY</codeph> columns to |
| ensure consistent ordering. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example uses the same <codeph>MAIL_MERGE</codeph> table as in the example for |
| <xref href="impala_analytic_functions.xml#first_value"/>. Because the default window when <codeph>ORDER |
| BY</codeph> is used is <codeph>BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW</codeph>, the query requires the |
| <codeph>UNBOUNDED FOLLOWING</codeph> to look ahead to subsequent rows and find the last value for each |
| country. |
| </p> |
| |
| <codeblock>select country, name, |
| last_value(greeting) over ( |
| partition by country order by name, greeting |
| rows between unbounded preceding and unbounded following |
| ) as greeting |
| from mail_merge |
| +---------+---------+--------------+ |
| | country | name | greeting | |
| +---------+---------+--------------+ |
| | Germany | Boris | Guten morgen | |
| | Germany | Michael | Guten morgen | |
| | Sweden | Bjorn | Tja | |
| | Sweden | Mats | Tja | |
| | USA | John | Hello | |
| | USA | Pete | Hello | |
| +---------+---------+--------------+ |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| <xref href="impala_analytic_functions.xml#first_value"/> |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.0.0" id="lead"> |
| |
| <title>LEAD Function</title> |
| |
| <conbody> |
| |
| <p> |
| This function returns the value of an expression using column values from a following row. You specify an |
| integer offset, which designates a row position some number of rows after to the current row. Any column |
| references in the expression argument refer to column values from that later row. Typically, the table |
| contains a time sequence or numeric sequence column that clearly distinguishes the ordering of the rows. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>LEAD (<varname>expr</varname> [, <varname>offset</varname>] [, <varname>default</varname>]) |
| OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> |
| |
| <p> |
| The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The |
| window clause is not allowed. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| Sometimes used an an alternative to doing a self-join. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example uses the same stock data created in <xref href="#window_clause"/>. The query analyzes |
| the closing price for a stock symbol, and for each day evaluates if the closing price for the following day |
| is higher or lower. |
| </p> |
| |
| <codeblock>select stock_symbol, closing_date, closing_price, |
| case |
| (lead(closing_price,1) |
| over (partition by stock_symbol order by closing_date) |
| - closing_price) > 0 |
| when true then "higher" |
| when false then "flat or lower" |
| end as "trending" |
| from stock_ticker |
| order by closing_date; |
| +--------------+---------------------+---------------+---------------+ |
| | stock_symbol | closing_date | closing_price | trending | |
| +--------------+---------------------+---------------+---------------+ |
| | JDR | 2014-09-13 00:00:00 | 12.86 | higher | |
| | JDR | 2014-09-14 00:00:00 | 12.89 | higher | |
| | JDR | 2014-09-15 00:00:00 | 12.94 | flat or lower | |
| | JDR | 2014-09-16 00:00:00 | 12.55 | higher | |
| | JDR | 2014-09-17 00:00:00 | 14.03 | higher | |
| | JDR | 2014-09-18 00:00:00 | 14.75 | flat or lower | |
| | JDR | 2014-09-19 00:00:00 | 13.98 | NULL | |
| +--------------+---------------------+---------------+---------------+ |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| This function is the converse of <xref href="impala_analytic_functions.xml#lag"/>. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="max_analytic"> |
| |
| <title>MAX Function - Analytic Context</title> |
| |
| <conbody> |
| |
| <p> |
| You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic |
| function. See <xref href="impala_max.xml#max"/> for details and examples. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="min_analytic"> |
| |
| <title>MIN Function - Analytic Context</title> |
| |
| <conbody> |
| |
| <p> |
| You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic |
| function. See <xref href="impala_min.xml#min"/> for details and examples. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept audience="hidden" rev="2.x.x" id="nth_value"> |
| |
| <title>NTH_VALUE Function</title> |
| |
| <conbody> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.3.0" id="ntile"> |
| |
| <title>NTILE Function (<keyword keyref="impala23"/> or higher only)</title> |
| |
| <conbody> |
| |
| <p> |
| Returns the <q>bucket number</q> associated with each row, between 1 and the value of an expression. For |
| example, creating 100 buckets puts the lowest 1% of values in the first bucket, while creating 10 buckets |
| puts the lowest 10% of values in the first bucket. Each partition can have a different number of buckets. |
| <!-- What's the syntax or data distribution that would create a different number of buckets per partition? --> |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>NTILE (<varname>expr</varname> [, <varname>offset</varname> ...] |
| OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> |
| |
| <p> |
| The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The |
| window clause is not allowed. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| The <q>ntile</q> name is derived from the practice of dividing result sets into fourths (quartile), tenths |
| (decile), and so on. The <codeph>NTILE()</codeph> function divides the result set based on an arbitrary |
| percentile value. |
| </p> |
| |
| <p> |
| The number of buckets must be a positive integer. |
| </p> |
| |
| <p> |
| The number of items in each bucket is identical or almost so, varying by at most 1. If the number of items |
| does not divide evenly between the buckets, the remaining N items are divided evenly among the first N |
| buckets. |
| </p> |
| |
| <p> |
| If the number of buckets N is greater than the number of input rows in the partition, then the first N |
| buckets each contain one item, and the remaining buckets are empty. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example shows divides groups of animals into 4 buckets based on their weight. The |
| <codeph>ORDER BY ... DESC</codeph> clause in the <codeph>OVER()</codeph> clause means that the heaviest 25% |
| are in the first group, and the lightest 25% are in the fourth group. (The <codeph>ORDER BY</codeph> in the |
| outermost part of the query shows how you can order the final result set independently from the order in |
| which the rows are evaluated by the <codeph>OVER()</codeph> clause.) Because there are 9 rows in the group, |
| divided into 4 buckets, the first bucket receives the extra item. |
| </p> |
| |
| <codeblock>create table animals (name string, kind string, kilos decimal(9,3)); |
| |
| insert into animals values |
| ('Elephant', 'Mammal', 4000), ('Giraffe', 'Mammal', 1200), ('Mouse', 'Mammal', 0.020), |
| ('Condor', 'Bird', 15), ('Horse', 'Mammal', 500), ('Owl', 'Bird', 2.5), |
| ('Ostrich', 'Bird', 145), ('Polar bear', 'Mammal', 700), ('Housecat', 'Mammal', 5); |
| |
| select name, ntile(4) over (order by kilos desc) as quarter |
| from animals |
| order by quarter desc; |
| +------------+---------+ |
| | name | quarter | |
| +------------+---------+ |
| | Owl | 4 | |
| | Mouse | 4 | |
| | Condor | 3 | |
| | Housecat | 3 | |
| | Horse | 2 | |
| | Ostrich | 2 | |
| | Elephant | 1 | |
| | Giraffe | 1 | |
| | Polar bear | 1 | |
| +------------+---------+ |
| </codeblock> |
| |
| <p> |
| The following examples show how the <codeph>PARTITION</codeph> clause works for the |
| <codeph>NTILE()</codeph> function. Here, we divide each kind of animal (mammal or bird) into 2 buckets, |
| the heavier half and the lighter half. |
| </p> |
| |
| <codeblock>select name, kind, ntile(2) over (partition by kind order by kilos desc) as half |
| from animals |
| order by kind; |
| +------------+--------+------+ |
| | name | kind | half | |
| +------------+--------+------+ |
| | Ostrich | Bird | 1 | |
| | Condor | Bird | 1 | |
| | Owl | Bird | 2 | |
| | Elephant | Mammal | 1 | |
| | Giraffe | Mammal | 1 | |
| | Polar bear | Mammal | 1 | |
| | Horse | Mammal | 2 | |
| | Housecat | Mammal | 2 | |
| | Mouse | Mammal | 2 | |
| +------------+--------+------+ |
| </codeblock> |
| |
| <p> |
| Again, the result set can be ordered independently |
| from the analytic evaluation. This next example lists all the animals heaviest to lightest, |
| showing that elephant and giraffe are in the <q>top half</q> of mammals by weight, while |
| housecat and mouse are in the <q>bottom half</q>. |
| </p> |
| |
| <codeblock>select name, kind, ntile(2) over (partition by kind order by kilos desc) as half |
| from animals |
| order by kilos desc; |
| +------------+--------+------+ |
| | name | kind | half | |
| +------------+--------+------+ |
| | Elephant | Mammal | 1 | |
| | Giraffe | Mammal | 1 | |
| | Polar bear | Mammal | 1 | |
| | Horse | Mammal | 2 | |
| | Ostrich | Bird | 1 | |
| | Condor | Bird | 1 | |
| | Housecat | Mammal | 2 | |
| | Owl | Bird | 2 | |
| | Mouse | Mammal | 2 | |
| +------------+--------+------+ |
| </codeblock> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.3.0" id="percent_rank"> |
| |
| <title>PERCENT_RANK Function (<keyword keyref="impala23"/> or higher only)</title> |
| |
| <conbody> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>PERCENT_RANK (<varname>expr</varname>) |
| OVER ([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>) |
| </codeblock> |
| |
| <p> |
| Calculates the rank, expressed as a percentage, of each row within a group of rows. |
| If <codeph>rank</codeph> is the value for that same row from the <codeph>RANK()</codeph> function (from 1 to the total number of rows in the partition group), |
| then the <codeph>PERCENT_RANK()</codeph> value is calculated as <codeph>(<varname>rank</varname> - 1) / (<varname>rows_in_group</varname> - 1)</codeph> . |
| If there is only a single item in the partition group, its <codeph>PERCENT_RANK()</codeph> value is 0. |
| </p> |
| |
| <p> |
| The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The |
| window clause is not allowed. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| This function is similar to the <codeph>RANK</codeph> and <codeph>CUME_DIST()</codeph> functions: it returns an ascending sequence representing the position of each |
| row within the rows of the same partition group. The actual numeric sequence is calculated differently, |
| and the handling of duplicate (tied) values is different. |
| </p> |
| |
| <p> |
| The return values range from 0 to 1 inclusive. |
| The first row in each partition group always has the value 0. |
| A <codeph>NULL</codeph> value is considered the lowest possible value. |
| In the case of duplicate input values, all the corresponding rows in the result set |
| have an identical value: the lowest <codeph>PERCENT_RANK()</codeph> value of those |
| tied rows. (In contrast to <codeph>CUME_DIST()</codeph>, where all tied rows have |
| the highest <codeph>CUME_DIST()</codeph> value.) |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example uses the same <codeph>ANIMALS</codeph> table as the examples for <codeph>CUME_DIST()</codeph> |
| and <codeph>NTILE()</codeph>, with a few additional rows to illustrate the results where some values are |
| <codeph>NULL</codeph> or there is only a single row in a partition group. |
| </p> |
| |
| <codeblock>insert into animals values ('Komodo dragon', 'Reptile', 70); |
| insert into animals values ('Unicorn', 'Mythical', NULL); |
| insert into animals values ('Fire-breathing dragon', 'Mythical', NULL); |
| </codeblock> |
| |
| <p> |
| As with <codeph>CUME_DIST()</codeph>, there is an ascending sequence for each kind of animal. |
| For example, the <q>Birds</q> and <q>Mammals</q> rows each have a <codeph>PERCENT_RANK()</codeph> sequence |
| that ranges from 0 to 1. |
| The <q>Reptile</q> row has a <codeph>PERCENT_RANK()</codeph> of 0 because that partition group contains only a single item. |
| Both <q>Mythical</q> animals have a <codeph>PERCENT_RANK()</codeph> of 0 because |
| a <codeph>NULL</codeph> is considered the lowest value within its partition group. |
| </p> |
| |
| <codeblock>select name, kind, percent_rank() over (partition by kind order by kilos) from animals; |
| +-----------------------+----------+--------------------------+ |
| | name | kind | percent_rank() OVER(...) | |
| +-----------------------+----------+--------------------------+ |
| | Mouse | Mammal | 0 | |
| | Housecat | Mammal | 0.2 | |
| | Horse | Mammal | 0.4 | |
| | Polar bear | Mammal | 0.6 | |
| | Giraffe | Mammal | 0.8 | |
| | Elephant | Mammal | 1 | |
| | Komodo dragon | Reptile | 0 | |
| | Owl | Bird | 0 | |
| | California Condor | Bird | 0.25 | |
| | Andean Condor | Bird | 0.25 | |
| | Condor | Bird | 0.25 | |
| | Ostrich | Bird | 1 | |
| | Fire-breathing dragon | Mythical | 0 | |
| | Unicorn | Mythical | 0 | |
| +-----------------------+----------+--------------------------+ |
| </codeblock> |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.0.0" id="rank"> |
| |
| <title>RANK Function</title> |
| |
| <conbody> |
| |
| <p> |
| Returns an ascending sequence of integers, starting with 1. The output sequence produces duplicate integers |
| for duplicate values of the <codeph>ORDER BY</codeph> expressions. After generating duplicate output values |
| for the <q>tied</q> input values, the function increments the sequence by the number of tied values. |
| Therefore, the sequence contains both duplicates and gaps when the input contains duplicates. Starts the |
| sequence over for each group produced by the <codeph>PARTITIONED BY</codeph> clause. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>RANK() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> |
| |
| <p> |
| The <codeph>PARTITION BY</codeph> clause is optional. The <codeph>ORDER BY</codeph> clause is required. The |
| window clause is not allowed. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <!-- Make a little tutorial to show these 3 functions side-by-side and illustrate their difference. --> |
| |
| <p> |
| Often used for top-N and bottom-N queries. For example, it could produce a <q>top 10</q> report including |
| several items that were tied for 10th place. |
| </p> |
| |
| <p> |
| Similar to <codeph>ROW_NUMBER</codeph> and <codeph>DENSE_RANK</codeph>. These functions differ in how they |
| treat duplicate combinations of values. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example demonstrates how the <codeph>RANK()</codeph> function identifies where each value |
| <q>places</q> in the result set, producing the same result for duplicate values, and skipping values in the |
| sequence to account for the number of duplicates. For example, when results are ordered by the |
| <codeph>X</codeph> column, both <codeph>1</codeph> values are tied for first; both <codeph>2</codeph> |
| values are tied for third; and so on. |
| </p> |
| |
| <codeblock>select x, rank() over(order by x) as rank, property from int_t; |
| +----+------+----------+ |
| | x | rank | property | |
| +----+------+----------+ |
| | 1 | 1 | square | |
| | 1 | 1 | odd | |
| | 2 | 3 | even | |
| | 2 | 3 | prime | |
| | 3 | 5 | prime | |
| | 3 | 5 | odd | |
| | 4 | 7 | even | |
| | 4 | 7 | square | |
| | 5 | 9 | odd | |
| | 5 | 9 | prime | |
| | 6 | 11 | even | |
| | 6 | 11 | perfect | |
| | 7 | 13 | lucky | |
| | 7 | 13 | lucky | |
| | 7 | 13 | lucky | |
| | 7 | 13 | odd | |
| | 7 | 13 | prime | |
| | 8 | 18 | even | |
| | 9 | 19 | square | |
| | 9 | 19 | odd | |
| | 10 | 21 | round | |
| | 10 | 21 | even | |
| +----+------+----------+ |
| </codeblock> |
| |
| <p> |
| The following examples show how the <codeph>RANK()</codeph> function is affected by the |
| <codeph>PARTITION</codeph> property within the <codeph>ORDER BY</codeph> clause. |
| </p> |
| |
| <p> |
| Partitioning by the <codeph>PROPERTY</codeph> column groups all the even, odd, and so on values together, |
| and <codeph>RANK()</codeph> returns the place of each value within the group, producing several ascending |
| sequences. |
| </p> |
| |
| <codeblock>select x, rank() over(partition by property order by x) as rank, property from int_t; |
| +----+------+----------+ |
| | x | rank | property | |
| +----+------+----------+ |
| | 2 | 1 | even | |
| | 4 | 2 | even | |
| | 6 | 3 | even | |
| | 8 | 4 | even | |
| | 10 | 5 | even | |
| | 7 | 1 | lucky | |
| | 7 | 1 | lucky | |
| | 7 | 1 | lucky | |
| | 1 | 1 | odd | |
| | 3 | 2 | odd | |
| | 5 | 3 | odd | |
| | 7 | 4 | odd | |
| | 9 | 5 | odd | |
| | 6 | 1 | perfect | |
| | 2 | 1 | prime | |
| | 3 | 2 | prime | |
| | 5 | 3 | prime | |
| | 7 | 4 | prime | |
| | 10 | 1 | round | |
| | 1 | 1 | square | |
| | 4 | 2 | square | |
| | 9 | 3 | square | |
| +----+------+----------+ |
| </codeblock> |
| |
| <p> |
| Partitioning by the <codeph>X</codeph> column groups all the duplicate numbers together and returns the |
| place each value within the group; because each value occurs only 1 or 2 times, |
| <codeph>RANK()</codeph> designates each <codeph>X</codeph> value as either first or second within its |
| group. |
| </p> |
| |
| <codeblock>select x, rank() over(partition by x order by property) as rank, property from int_t; |
| +----+------+----------+ |
| | x | rank | property | |
| +----+------+----------+ |
| | 1 | 1 | odd | |
| | 1 | 2 | square | |
| | 2 | 1 | even | |
| | 2 | 2 | prime | |
| | 3 | 1 | odd | |
| | 3 | 2 | prime | |
| | 4 | 1 | even | |
| | 4 | 2 | square | |
| | 5 | 1 | odd | |
| | 5 | 2 | prime | |
| | 6 | 1 | even | |
| | 6 | 2 | perfect | |
| | 7 | 1 | lucky | |
| | 7 | 1 | lucky | |
| | 7 | 1 | lucky | |
| | 7 | 4 | odd | |
| | 7 | 5 | prime | |
| | 8 | 1 | even | |
| | 9 | 1 | odd | |
| | 9 | 2 | square | |
| | 10 | 1 | even | |
| | 10 | 2 | round | |
| +----+------+----------+ |
| </codeblock> |
| |
| <p> |
| The following example shows how a magazine might prepare a list of history's wealthiest people. Croesus and |
| Midas are tied for second, then Crassus is fourth. |
| </p> |
| |
| <codeblock>select rank() over (order by net_worth desc) as rank, name, net_worth from wealth order by rank, name; |
| +------+---------+---------------+ |
| | rank | name | net_worth | |
| +------+---------+---------------+ |
| | 1 | Solomon | 2000000000.00 | |
| | 2 | Croesus | 1000000000.00 | |
| | 2 | Midas | 1000000000.00 | |
| | 4 | Crassus | 500000000.00 | |
| | 5 | Scrooge | 80000000.00 | |
| +------+---------+---------------+ |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| <xref href="impala_analytic_functions.xml#dense_rank"/>, |
| <xref href="impala_analytic_functions.xml#row_number"/> |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept rev="2.0.0" id="row_number"> |
| |
| <title>ROW_NUMBER Function</title> |
| |
| <conbody> |
| |
| <p> |
| Returns an ascending sequence of integers, starting with 1. Starts the sequence over for each group |
| produced by the <codeph>PARTITIONED BY</codeph> clause. The output sequence includes different values for |
| duplicate input values. Therefore, the sequence never contains any duplicates or gaps, regardless of |
| duplicate input values. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/syntax_blurb"/> |
| |
| <codeblock>ROW_NUMBER() OVER([<varname>partition_by_clause</varname>] <varname>order_by_clause</varname>)</codeblock> |
| |
| <p> |
| The <codeph>ORDER BY</codeph> clause is required. The <codeph>PARTITION BY</codeph> clause is optional. The |
| window clause is not allowed. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/usage_notes_blurb"/> |
| |
| <p> |
| Often used for top-N and bottom-N queries where the input values are known to be unique, or precisely N |
| rows are needed regardless of duplicate values. |
| </p> |
| |
| <p> |
| Because its result value is different for each row in the result set (when used without a <codeph>PARTITION |
| BY</codeph> clause), <codeph>ROW_NUMBER()</codeph> can be used to synthesize unique numeric ID values, for |
| example for result sets involving unique values or tuples. |
| </p> |
| |
| <p> |
| Similar to <codeph>RANK</codeph> and <codeph>DENSE_RANK</codeph>. These functions differ in how they treat |
| duplicate combinations of values. |
| </p> |
| |
| <p conref="../shared/impala_common.xml#common/added_in_20"/> |
| |
| <p conref="../shared/impala_common.xml#common/example_blurb"/> |
| |
| <p> |
| The following example demonstrates how <codeph>ROW_NUMBER()</codeph> produces a continuous numeric |
| sequence, even though some values of <codeph>X</codeph> are repeated. |
| </p> |
| |
| <codeblock>select x, row_number() over(order by x, property) as row_number, property from int_t; |
| +----+------------+----------+ |
| | x | row_number | property | |
| +----+------------+----------+ |
| | 1 | 1 | odd | |
| | 1 | 2 | square | |
| | 2 | 3 | even | |
| | 2 | 4 | prime | |
| | 3 | 5 | odd | |
| | 3 | 6 | prime | |
| | 4 | 7 | even | |
| | 4 | 8 | square | |
| | 5 | 9 | odd | |
| | 5 | 10 | prime | |
| | 6 | 11 | even | |
| | 6 | 12 | perfect | |
| | 7 | 13 | lucky | |
| | 7 | 14 | lucky | |
| | 7 | 15 | lucky | |
| | 7 | 16 | odd | |
| | 7 | 17 | prime | |
| | 8 | 18 | even | |
| | 9 | 19 | odd | |
| | 9 | 20 | square | |
| | 10 | 21 | even | |
| | 10 | 22 | round | |
| +----+------------+----------+ |
| </codeblock> |
| |
| <p> |
| The following example shows how a financial institution might assign customer IDs to some of history's |
| wealthiest figures. Although two of the people have identical net worth figures, unique IDs are required |
| for this purpose. <codeph>ROW_NUMBER()</codeph> produces a sequence of five different values for the five |
| input rows. |
| </p> |
| |
| <codeblock>select row_number() over (order by net_worth desc) as account_id, name, net_worth |
| from wealth order by account_id, name; |
| +------------+---------+---------------+ |
| | account_id | name | net_worth | |
| +------------+---------+---------------+ |
| | 1 | Solomon | 2000000000.00 | |
| | 2 | Croesus | 1000000000.00 | |
| | 3 | Midas | 1000000000.00 | |
| | 4 | Crassus | 500000000.00 | |
| | 5 | Scrooge | 80000000.00 | |
| +------------+---------+---------------+ |
| </codeblock> |
| |
| <p conref="../shared/impala_common.xml#common/related_info"/> |
| |
| <p> |
| <xref href="impala_analytic_functions.xml#rank"/>, <xref href="impala_analytic_functions.xml#dense_rank"/> |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| <concept id="sum_analytic"> |
| |
| <title>SUM Function - Analytic Context</title> |
| |
| <conbody> |
| |
| <p> |
| You can include an <codeph>OVER</codeph> clause with a call to this function to use it as an analytic |
| function. See <xref href="impala_sum.xml#sum"/> for details and examples. |
| </p> |
| |
| </conbody> |
| |
| </concept> |
| |
| </concept> |