blob: 82266854f5834c64f0f1e88f3c2a0c8aeefe8932 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
.. _window_functions:
Window Functions
================
In this section you will learn about window functions. A window function utilizes values from one or
multiple rows to produce a result for each individual row, unlike an aggregate function that
provides a single value for multiple rows.
The window functions are available in the :py:mod:`~datafusion.functions` module.
We'll use the pokemon dataset (from Ritchie Vink) in the following examples.
.. ipython:: python
from datafusion import SessionContext
from datafusion import col
from datafusion import functions as f
ctx = SessionContext()
df = ctx.read_csv("pokemon.csv")
Here is an example that shows how you can compare each pokemon's speed to the speed of the
previous row in the DataFrame.
.. ipython:: python
df.select(
col('"Name"'),
col('"Speed"'),
f.lag(col('"Speed"')).alias("Previous Speed")
)
Setting Parameters
------------------
Ordering
^^^^^^^^
You can control the order in which rows are processed by window functions by providing
a list of ``order_by`` functions for the ``order_by`` parameter.
.. ipython:: python
df.select(
col('"Name"'),
col('"Attack"'),
col('"Type 1"'),
f.rank(
partition_by=[col('"Type 1"')],
order_by=[col('"Attack"').sort(ascending=True)],
).alias("rank"),
).sort(col('"Type 1"'), col('"Attack"'))
Partitions
^^^^^^^^^^
A window function can take a list of ``partition_by`` columns similar to an
:ref:`Aggregation Function<aggregation>`. This will cause the window values to be evaluated
independently for each of the partitions. In the example above, we found the rank of each
Pokemon per ``Type 1`` partitions. We can see the first couple of each partition if we do
the following:
.. ipython:: python
df.select(
col('"Name"'),
col('"Attack"'),
col('"Type 1"'),
f.rank(
partition_by=[col('"Type 1"')],
order_by=[col('"Attack"').sort(ascending=True)],
).alias("rank"),
).filter(col("rank") < lit(3)).sort(col('"Type 1"'), col("rank"))
Window Frame
^^^^^^^^^^^^
When using aggregate functions, the Window Frame of defines the rows over which it operates.
If you do not specify a Window Frame, the frame will be set depending on the following
criteria.
* If an ``order_by`` clause is set, the default window frame is defined as the rows between
unbounded preceding and the current row.
* If an ``order_by`` is not set, the default frame is defined as the rows between unbounded
and unbounded following (the entire partition).
Window Frames are defined by three parameters: unit type, starting bound, and ending bound.
The unit types available are:
* Rows: The starting and ending boundaries are defined by the number of rows relative to the
current row.
* Range: When using Range, the ``order_by`` clause must have exactly one term. The boundaries
are defined bow how close the rows are to the value of the expression in the ``order_by``
parameter.
* Groups: A "group" is the set of all rows that have equivalent values for all terms in the
``order_by`` clause.
In this example we perform a "rolling average" of the speed of the current Pokemon and the
two preceding rows.
.. ipython:: python
from datafusion.expr import WindowFrame
df.select(
col('"Name"'),
col('"Speed"'),
f.window("avg",
[col('"Speed"')],
order_by=[col('"Speed"')],
window_frame=WindowFrame("rows", 2, 0)
).alias("Previous Speed")
)
Null Treatment
^^^^^^^^^^^^^^
When using aggregate functions as window functions, it is often useful to specify how null values
should be treated. In order to do this you need to use the builder function. In future releases
we expect this to be simplified in the interface.
One common usage for handling nulls is the case where you want to find the last value up to the
current row. In the following example we demonstrate how setting the null treatment to ignore
nulls will fill in with the value of the most recent non-null row. To do this, we also will set
the window frame so that we only process up to the current row.
In this example, we filter down to one specific type of Pokemon that does have some entries in
it's ``Type 2`` column that are null.
.. ipython:: python
from datafusion.common import NullTreatment
df.filter(col('"Type 1"') == lit("Bug")).select(
'"Name"',
'"Type 2"',
f.window("last_value", [col('"Type 2"')])
.window_frame(WindowFrame("rows", None, 0))
.order_by(col('"Speed"'))
.null_treatment(NullTreatment.IGNORE_NULLS)
.build()
.alias("last_wo_null"),
f.window("last_value", [col('"Type 2"')])
.window_frame(WindowFrame("rows", None, 0))
.order_by(col('"Speed"'))
.null_treatment(NullTreatment.RESPECT_NULLS)
.build()
.alias("last_with_null")
)
Aggregate Functions
-------------------
You can use any :ref:`Aggregation Function<aggregation>` as a window function. Currently
aggregate functions must use the deprecated
:py:func:`datafusion.functions.window` API but this should be resolved in
DataFusion 42.0 (`Issue Link <https://github.com/apache/datafusion-python/issues/833>`_). Here
is an example that shows how to compare each pokemons’s attack power with the average attack
power in its ``"Type 1"`` using the :py:func:`datafusion.functions.avg` function.
.. ipython:: python
:okwarning:
df.select(
col('"Name"'),
col('"Attack"'),
col('"Type 1"'),
f.window("avg", [col('"Attack"')])
.partition_by(col('"Type 1"'))
.build()
.alias("Average Attack"),
)
Available Functions
-------------------
The possible window functions are:
1. Rank Functions
- :py:func:`datafusion.functions.rank`
- :py:func:`datafusion.functions.dense_rank`
- :py:func:`datafusion.functions.ntile`
- :py:func:`datafusion.functions.row_number`
2. Analytical Functions
- :py:func:`datafusion.functions.cume_dist`
- :py:func:`datafusion.functions.percent_rank`
- :py:func:`datafusion.functions.lag`
- :py:func:`datafusion.functions.lead`
3. Aggregate Functions
- All :ref:`Aggregation Functions<aggregation>` can be used as window functions.