docs/source/user-guide/common-operations/windows.rst - datafusion-python - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.

 .. _window_functions:

 Window Functions
 ================

 In this section you will learn about window functions. A window function utilizes values from one or
 multiple rows to produce a result for each individual row, unlike an aggregate function that
 provides a single value for multiple rows.

 The window functions are available in the :py:mod:`~datafusion.functions` module.

 We'll use the pokemon dataset (from Ritchie Vink) in the following examples.

 .. ipython:: python

     from datafusion import SessionContext
     from datafusion import col
     from datafusion import functions as f

     ctx = SessionContext()
     df = ctx.read_csv("pokemon.csv")

 Here is an example that shows how you can compare each pokemon's speed to the speed of the
 previous row in the DataFrame.

 .. ipython:: python

     df.select(
         col('"Name"'),
         col('"Speed"'),
         f.lag(col('"Speed"')).alias("Previous Speed")
     )

 Setting Parameters
 ------------------


 Ordering
 ^^^^^^^^

 You can control the order in which rows are processed by window functions by providing
 a list of ``order_by`` functions for the ``order_by`` parameter.

 .. ipython:: python

     df.select(
         col('"Name"'),
         col('"Attack"'),
         col('"Type 1"'),
         f.rank(
             partition_by=[col('"Type 1"')],
             order_by=[col('"Attack"').sort(ascending=True)],
         ).alias("rank"),
     ).sort(col('"Type 1"'), col('"Attack"'))

 Partitions
 ^^^^^^^^^^

 A window function can take a list of ``partition_by`` columns similar to an
 :ref:`Aggregation Function<aggregation>`. This will cause the window values to be evaluated
 independently for each of the partitions. In the example above, we found the rank of each
 Pokemon per ``Type 1`` partitions. We can see the first couple of each partition if we do
 the following:

 .. ipython:: python

     df.select(
         col('"Name"'),
         col('"Attack"'),
         col('"Type 1"'),
         f.rank(
             partition_by=[col('"Type 1"')],
             order_by=[col('"Attack"').sort(ascending=True)],
         ).alias("rank"),
     ).filter(col("rank") < lit(3)).sort(col('"Type 1"'), col("rank"))

 Window Frame
 ^^^^^^^^^^^^

 When using aggregate functions, the Window Frame of defines the rows over which it operates.
 If you do not specify a Window Frame, the frame will be set depending on the following
 criteria.

 * If an ``order_by`` clause is set, the default window frame is defined as the rows between
   unbounded preceding and the current row.
 * If an ``order_by`` is not set, the default frame is defined as the rows between unbounded
   and unbounded following (the entire partition).

 Window Frames are defined by three parameters: unit type, starting bound, and ending bound.

 The unit types available are:

 * Rows: The starting and ending boundaries are defined by the number of rows relative to the
   current row.
 * Range: When using Range, the ``order_by`` clause must have exactly one term. The boundaries
   are defined bow how close the rows are to the value of the expression in the ``order_by``
   parameter.
 * Groups: A "group" is the set of all rows that have equivalent values for all terms in the
   ``order_by`` clause.

 In this example we perform a "rolling average" of the speed of the current Pokemon and the
 two preceding rows.

 .. ipython:: python

     from datafusion.expr import WindowFrame

     df.select(
         col('"Name"'),
         col('"Speed"'),
         f.window("avg",
             [col('"Speed"')],
             order_by=[col('"Speed"')],
             window_frame=WindowFrame("rows", 2, 0)
         ).alias("Previous Speed")
     )

 Null Treatment
 ^^^^^^^^^^^^^^

 When using aggregate functions as window functions, it is often useful to specify how null values
 should be treated. In order to do this you need to use the builder function. In future releases
 we expect this to be simplified in the interface.

 One common usage for handling nulls is the case where you want to find the last value up to the
 current row. In the following example we demonstrate how setting the null treatment to ignore
 nulls will fill in with the value of the most recent non-null row. To do this, we also will set
 the window frame so that we only process up to the current row.

 In this example, we filter down to one specific type of Pokemon that does have some entries in
 it's ``Type 2`` column that are null.

 .. ipython:: python

     from datafusion.common import NullTreatment

     df.filter(col('"Type 1"') ==  lit("Bug")).select(
         '"Name"',
         '"Type 2"',
         f.window("last_value", [col('"Type 2"')])
             .window_frame(WindowFrame("rows", None, 0))
             .order_by(col('"Speed"'))
             .null_treatment(NullTreatment.IGNORE_NULLS)
             .build()
             .alias("last_wo_null"),
         f.window("last_value", [col('"Type 2"')])
             .window_frame(WindowFrame("rows", None, 0))
             .order_by(col('"Speed"'))
             .null_treatment(NullTreatment.RESPECT_NULLS)
             .build()
             .alias("last_with_null")
     )

 Aggregate Functions
 -------------------

 You can use any :ref:`Aggregation Function<aggregation>` as a window function. Currently
 aggregate functions must use the deprecated
 :py:func:`datafusion.functions.window` API but this should be resolved in
 DataFusion 42.0 (`Issue Link <https://github.com/apache/datafusion-python/issues/833>`_). Here
 is an example that shows how to compare each pokemons’s attack power with the average attack
 power in its ``"Type 1"`` using the :py:func:`datafusion.functions.avg` function.

 .. ipython:: python
     :okwarning:

     df.select(
         col('"Name"'),
         col('"Attack"'),
         col('"Type 1"'),
         f.window("avg", [col('"Attack"')])
             .partition_by(col('"Type 1"'))
             .build()
             .alias("Average Attack"),
     )

 Available Functions
 -------------------

 The possible window functions are:

 1. Rank Functions
     - :py:func:`datafusion.functions.rank`
     - :py:func:`datafusion.functions.dense_rank`
     - :py:func:`datafusion.functions.ntile`
     - :py:func:`datafusion.functions.row_number`

 2. Analytical Functions
     - :py:func:`datafusion.functions.cume_dist`
     - :py:func:`datafusion.functions.percent_rank`
     - :py:func:`datafusion.functions.lag`
     - :py:func:`datafusion.functions.lead`

 3. Aggregate Functions
     - All :ref:`Aggregation Functions<aggregation>` can be used as window functions.
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.

	.. _window_functions:

	Window Functions
	================

	In this section you will learn about window functions. A window function utilizes values from one or
	multiple rows to produce a result for each individual row, unlike an aggregate function that
	provides a single value for multiple rows.

	The window functions are available in the :py:mod:`~datafusion.functions` module.

	We'll use the pokemon dataset (from Ritchie Vink) in the following examples.

	.. ipython:: python

	from datafusion import SessionContext
	from datafusion import col
	from datafusion import functions as f

	ctx = SessionContext()
	df = ctx.read_csv("pokemon.csv")

	Here is an example that shows how you can compare each pokemon's speed to the speed of the
	previous row in the DataFrame.

	.. ipython:: python

	df.select(
	col('"Name"'),
	col('"Speed"'),
	f.lag(col('"Speed"')).alias("Previous Speed")
	)

	Setting Parameters
	------------------


	Ordering
	^^^^^^^^

	You can control the order in which rows are processed by window functions by providing
	a list of ``order_by`` functions for the ``order_by`` parameter.

	.. ipython:: python

	df.select(
	col('"Name"'),
	col('"Attack"'),
	col('"Type 1"'),
	f.rank(
	partition_by=[col('"Type 1"')],
	order_by=[col('"Attack"').sort(ascending=True)],
	).alias("rank"),
	).sort(col('"Type 1"'), col('"Attack"'))

	Partitions
	^^^^^^^^^^

	A window function can take a list of ``partition_by`` columns similar to an
	:ref:`Aggregation Function<aggregation>`. This will cause the window values to be evaluated
	independently for each of the partitions. In the example above, we found the rank of each
	Pokemon per ``Type 1`` partitions. We can see the first couple of each partition if we do
	the following:

	.. ipython:: python

	df.select(
	col('"Name"'),
	col('"Attack"'),
	col('"Type 1"'),
	f.rank(
	partition_by=[col('"Type 1"')],
	order_by=[col('"Attack"').sort(ascending=True)],
	).alias("rank"),
	).filter(col("rank") < lit(3)).sort(col('"Type 1"'), col("rank"))

	Window Frame
	^^^^^^^^^^^^

	When using aggregate functions, the Window Frame of defines the rows over which it operates.
	If you do not specify a Window Frame, the frame will be set depending on the following
	criteria.

	* If an ``order_by`` clause is set, the default window frame is defined as the rows between
	unbounded preceding and the current row.
	* If an ``order_by`` is not set, the default frame is defined as the rows between unbounded
	and unbounded following (the entire partition).

	Window Frames are defined by three parameters: unit type, starting bound, and ending bound.

	The unit types available are:

	* Rows: The starting and ending boundaries are defined by the number of rows relative to the
	current row.
	* Range: When using Range, the ``order_by`` clause must have exactly one term. The boundaries
	are defined bow how close the rows are to the value of the expression in the ``order_by``
	parameter.
	* Groups: A "group" is the set of all rows that have equivalent values for all terms in the
	``order_by`` clause.

	In this example we perform a "rolling average" of the speed of the current Pokemon and the
	two preceding rows.

	.. ipython:: python

	from datafusion.expr import WindowFrame

	df.select(
	col('"Name"'),
	col('"Speed"'),
	f.window("avg",
	[col('"Speed"')],
	order_by=[col('"Speed"')],
	window_frame=WindowFrame("rows", 2, 0)
	).alias("Previous Speed")
	)

	Null Treatment
	^^^^^^^^^^^^^^

	When using aggregate functions as window functions, it is often useful to specify how null values
	should be treated. In order to do this you need to use the builder function. In future releases
	we expect this to be simplified in the interface.

	One common usage for handling nulls is the case where you want to find the last value up to the
	current row. In the following example we demonstrate how setting the null treatment to ignore
	nulls will fill in with the value of the most recent non-null row. To do this, we also will set
	the window frame so that we only process up to the current row.

	In this example, we filter down to one specific type of Pokemon that does have some entries in
	it's ``Type 2`` column that are null.

	.. ipython:: python

	from datafusion.common import NullTreatment

	df.filter(col('"Type 1"') == lit("Bug")).select(
	'"Name"',
	'"Type 2"',
	f.window("last_value", [col('"Type 2"')])
	.window_frame(WindowFrame("rows", None, 0))
	.order_by(col('"Speed"'))
	.null_treatment(NullTreatment.IGNORE_NULLS)
	.build()
	.alias("last_wo_null"),
	f.window("last_value", [col('"Type 2"')])
	.window_frame(WindowFrame("rows", None, 0))
	.order_by(col('"Speed"'))
	.null_treatment(NullTreatment.RESPECT_NULLS)
	.build()
	.alias("last_with_null")
	)

	Aggregate Functions
	-------------------

	You can use any :ref:`Aggregation Function<aggregation>` as a window function. Currently
	aggregate functions must use the deprecated
	:py:func:`datafusion.functions.window` API but this should be resolved in
	DataFusion 42.0 (`Issue Link <https://github.com/apache/datafusion-python/issues/833>`_). Here
	is an example that shows how to compare each pokemons’s attack power with the average attack
	power in its ``"Type 1"`` using the :py:func:`datafusion.functions.avg` function.

	.. ipython:: python
	:okwarning:

	df.select(
	col('"Name"'),
	col('"Attack"'),
	col('"Type 1"'),
	f.window("avg", [col('"Attack"')])
	.partition_by(col('"Type 1"'))
	.build()
	.alias("Average Attack"),
	)

	Available Functions
	-------------------

	The possible window functions are:

	1. Rank Functions
	- :py:func:`datafusion.functions.rank`
	- :py:func:`datafusion.functions.dense_rank`
	- :py:func:`datafusion.functions.ntile`
	- :py:func:`datafusion.functions.row_number`

	2. Analytical Functions
	- :py:func:`datafusion.functions.cume_dist`
	- :py:func:`datafusion.functions.percent_rank`
	- :py:func:`datafusion.functions.lag`
	- :py:func:`datafusion.functions.lead`

	3. Aggregate Functions
	- All :ref:`Aggregation Functions<aggregation>` can be used as window functions.