docs/source/user-guide/common-operations/aggregations.rst - datafusion-python - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.

 .. _aggregation:

 Aggregation
 ============

 An aggregate or aggregation is a function where the values of multiple rows are processed together
 to form a single summary value. For performing an aggregation, DataFusion provides the
 :py:func:`~datafusion.dataframe.DataFrame.aggregate`

 .. ipython:: python

     from datafusion import SessionContext, col, lit, functions as f

     ctx = SessionContext()
     df = ctx.read_csv("pokemon.csv")

     col_type_1 = col('"Type 1"')
     col_type_2 = col('"Type 2"')
     col_speed = col('"Speed"')
     col_attack = col('"Attack"')

     df.aggregate([col_type_1], [
         f.approx_distinct(col_speed).alias("Count"),
         f.approx_median(col_speed).alias("Median Speed"),
         f.approx_percentile_cont(col_speed, 0.9).alias("90% Speed")])

 When the :code:`group_by` list is empty the aggregation is done over the whole :class:`.DataFrame`.
 For grouping the :code:`group_by` list must contain at least one column.

 .. ipython:: python

     df.aggregate([col_type_1], [
         f.max(col_speed).alias("Max Speed"),
         f.avg(col_speed).alias("Avg Speed"),
         f.min(col_speed).alias("Min Speed")])

 More than one column can be used for grouping

 .. ipython:: python

     df.aggregate([col_type_1, col_type_2], [
         f.max(col_speed).alias("Max Speed"),
         f.avg(col_speed).alias("Avg Speed"),
         f.min(col_speed).alias("Min Speed")])


 Setting Parameters
 ------------------

 Each of the built in aggregate functions provides arguments for the parameters that affect their
 operation. These can also be overridden using the builder approach to setting any of the following
 parameters. When you use the builder, you must call ``build()`` to finish. For example, these two
 expressions are equivalent.

 .. ipython:: python

     first_1 = f.first_value(col("a"), order_by=[col("a")])
     first_2 = f.first_value(col("a")).order_by(col("a")).build()

 Ordering
 ^^^^^^^^

 You can control the order in which rows are processed by window functions by providing
 a list of ``order_by`` functions for the ``order_by`` parameter. In the following example, we
 sort the Pokemon by their attack in increasing order and take the first value, which gives us the
 Pokemon with the smallest attack value in each ``Type 1``.

 .. ipython:: python

     df.aggregate(
         [col('"Type 1"')],
         [f.first_value(
             col('"Name"'),
             order_by=[col('"Attack"').sort(ascending=True)]
             ).alias("Smallest Attack")
         ])

 Distinct
 ^^^^^^^^

 When you set the parameter ``distinct`` to ``True``, then unique values will only be evaluated one
 time each. Suppose we want to create an array of all of the ``Type 2`` for each ``Type 1`` of our
 Pokemon set. Since there will be many entries of ``Type 2`` we only one each distinct value.

 .. ipython:: python

     df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")])

 In the output of the above we can see that there are some ``Type 1`` for which the ``Type 2`` entry
 is ``null``. In reality, we probably want to filter those out. We can do this in two ways. First,
 we can filter DataFrame rows that have no ``Type 2``. If we do this, we might have some ``Type 1``
 entries entirely removed. The second is we can use the ``filter`` argument described below.

 .. ipython:: python

     df.filter(col_type_2.is_not_null()).aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")])

     df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True, filter=col_type_2.is_not_null()).alias("Type 2 List")])

 Which approach you take should depend on your use case.

 Null Treatment
 ^^^^^^^^^^^^^^

 This option allows you to either respect or ignore null values.

 One common usage for handling nulls is the case where you want to find the first value within a
 partition. By setting the null treatment to ignore nulls, we can find the first non-null value
 in our partition.


 .. ipython:: python

     from datafusion.common import NullTreatment

     df.aggregate([col_type_1], [
         f.first_value(
             col_type_2,
             order_by=[col_attack],
             null_treatment=NullTreatment.RESPECT_NULLS
         ).alias("Lowest Attack Type 2")])

     df.aggregate([col_type_1], [
         f.first_value(
             col_type_2,
             order_by=[col_attack],
             null_treatment=NullTreatment.IGNORE_NULLS
         ).alias("Lowest Attack Type 2")])

 Filter
 ^^^^^^

 Using the filter option is useful for filtering results to include in the aggregate function. It can
 be seen in the example above on how this can be useful to only filter rows evaluated by the
 aggregate function without filtering rows from the entire DataFrame.

 Filter takes a single expression.

 Suppose we want to find the speed values for only Pokemon that have low Attack values.

 .. ipython:: python

     df.aggregate([col_type_1], [
         f.avg(col_speed).alias("Avg Speed All"),
         f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])


 Aggregate Functions
 -------------------

 The available aggregate functions are:

 1. Comparison Functions
     - :py:func:`datafusion.functions.min`
     - :py:func:`datafusion.functions.max`
 2. Math Functions
     - :py:func:`datafusion.functions.sum`
     - :py:func:`datafusion.functions.avg`
     - :py:func:`datafusion.functions.median`
 3. Array Functions
     - :py:func:`datafusion.functions.array_agg`
 4. Logical Functions
     - :py:func:`datafusion.functions.bit_and`
     - :py:func:`datafusion.functions.bit_or`
     - :py:func:`datafusion.functions.bit_xor`
     - :py:func:`datafusion.functions.bool_and`
     - :py:func:`datafusion.functions.bool_or`
 5. Statistical Functions
     - :py:func:`datafusion.functions.count`
     - :py:func:`datafusion.functions.corr`
     - :py:func:`datafusion.functions.covar_samp`
     - :py:func:`datafusion.functions.covar_pop`
     - :py:func:`datafusion.functions.stddev`
     - :py:func:`datafusion.functions.stddev_pop`
     - :py:func:`datafusion.functions.var_samp`
     - :py:func:`datafusion.functions.var_pop`
 6. Linear Regression Functions
     - :py:func:`datafusion.functions.regr_count`
     - :py:func:`datafusion.functions.regr_slope`
     - :py:func:`datafusion.functions.regr_intercept`
     - :py:func:`datafusion.functions.regr_r2`
     - :py:func:`datafusion.functions.regr_avgx`
     - :py:func:`datafusion.functions.regr_avgy`
     - :py:func:`datafusion.functions.regr_sxx`
     - :py:func:`datafusion.functions.regr_syy`
     - :py:func:`datafusion.functions.regr_slope`
 7. Positional Functions
     - :py:func:`datafusion.functions.first_value`
     - :py:func:`datafusion.functions.last_value`
     - :py:func:`datafusion.functions.nth_value`
 8. String Functions
     - :py:func:`datafusion.functions.string_agg`
 9. Approximation Functions
     - :py:func:`datafusion.functions.approx_distinct`
     - :py:func:`datafusion.functions.approx_median`
     - :py:func:`datafusion.functions.approx_percentile_cont`
     - :py:func:`datafusion.functions.approx_percentile_cont_with_weight`
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.

	.. _aggregation:

	Aggregation
	============

	An aggregate or aggregation is a function where the values of multiple rows are processed together
	to form a single summary value. For performing an aggregation, DataFusion provides the
	:py:func:`~datafusion.dataframe.DataFrame.aggregate`

	.. ipython:: python

	from datafusion import SessionContext, col, lit, functions as f

	ctx = SessionContext()
	df = ctx.read_csv("pokemon.csv")

	col_type_1 = col('"Type 1"')
	col_type_2 = col('"Type 2"')
	col_speed = col('"Speed"')
	col_attack = col('"Attack"')

	df.aggregate([col_type_1], [
	f.approx_distinct(col_speed).alias("Count"),
	f.approx_median(col_speed).alias("Median Speed"),
	f.approx_percentile_cont(col_speed, 0.9).alias("90% Speed")])

	When the :code:`group_by` list is empty the aggregation is done over the whole :class:`.DataFrame`.
	For grouping the :code:`group_by` list must contain at least one column.

	.. ipython:: python

	df.aggregate([col_type_1], [
	f.max(col_speed).alias("Max Speed"),
	f.avg(col_speed).alias("Avg Speed"),
	f.min(col_speed).alias("Min Speed")])

	More than one column can be used for grouping

	.. ipython:: python

	df.aggregate([col_type_1, col_type_2], [
	f.max(col_speed).alias("Max Speed"),
	f.avg(col_speed).alias("Avg Speed"),
	f.min(col_speed).alias("Min Speed")])



	Setting Parameters
	------------------

	Each of the built in aggregate functions provides arguments for the parameters that affect their
	operation. These can also be overridden using the builder approach to setting any of the following
	parameters. When you use the builder, you must call ``build()`` to finish. For example, these two
	expressions are equivalent.

	.. ipython:: python

	first_1 = f.first_value(col("a"), order_by=[col("a")])
	first_2 = f.first_value(col("a")).order_by(col("a")).build()

	Ordering
	^^^^^^^^

	You can control the order in which rows are processed by window functions by providing
	a list of ``order_by`` functions for the ``order_by`` parameter. In the following example, we
	sort the Pokemon by their attack in increasing order and take the first value, which gives us the
	Pokemon with the smallest attack value in each ``Type 1``.

	.. ipython:: python

	df.aggregate(
	[col('"Type 1"')],
	[f.first_value(
	col('"Name"'),
	order_by=[col('"Attack"').sort(ascending=True)]
	).alias("Smallest Attack")
	])

	Distinct
	^^^^^^^^

	When you set the parameter ``distinct`` to ``True``, then unique values will only be evaluated one
	time each. Suppose we want to create an array of all of the ``Type 2`` for each ``Type 1`` of our
	Pokemon set. Since there will be many entries of ``Type 2`` we only one each distinct value.

	.. ipython:: python

	df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")])

	In the output of the above we can see that there are some ``Type 1`` for which the ``Type 2`` entry
	is ``null``. In reality, we probably want to filter those out. We can do this in two ways. First,
	we can filter DataFrame rows that have no ``Type 2``. If we do this, we might have some ``Type 1``
	entries entirely removed. The second is we can use the ``filter`` argument described below.

	.. ipython:: python

	df.filter(col_type_2.is_not_null()).aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True).alias("Type 2 List")])

	df.aggregate([col_type_1], [f.array_agg(col_type_2, distinct=True, filter=col_type_2.is_not_null()).alias("Type 2 List")])

	Which approach you take should depend on your use case.

	Null Treatment
	^^^^^^^^^^^^^^

	This option allows you to either respect or ignore null values.

	One common usage for handling nulls is the case where you want to find the first value within a
	partition. By setting the null treatment to ignore nulls, we can find the first non-null value
	in our partition.


	.. ipython:: python

	from datafusion.common import NullTreatment

	df.aggregate([col_type_1], [
	f.first_value(
	col_type_2,
	order_by=[col_attack],
	null_treatment=NullTreatment.RESPECT_NULLS
	).alias("Lowest Attack Type 2")])

	df.aggregate([col_type_1], [
	f.first_value(
	col_type_2,
	order_by=[col_attack],
	null_treatment=NullTreatment.IGNORE_NULLS
	).alias("Lowest Attack Type 2")])

	Filter
	^^^^^^

	Using the filter option is useful for filtering results to include in the aggregate function. It can
	be seen in the example above on how this can be useful to only filter rows evaluated by the
	aggregate function without filtering rows from the entire DataFrame.

	Filter takes a single expression.

	Suppose we want to find the speed values for only Pokemon that have low Attack values.

	.. ipython:: python

	df.aggregate([col_type_1], [
	f.avg(col_speed).alias("Avg Speed All"),
	f.avg(col_speed, filter=col_attack < lit(50)).alias("Avg Speed Low Attack")])


	Aggregate Functions
	-------------------

	The available aggregate functions are:

	1. Comparison Functions
	- :py:func:`datafusion.functions.min`
	- :py:func:`datafusion.functions.max`
	2. Math Functions
	- :py:func:`datafusion.functions.sum`
	- :py:func:`datafusion.functions.avg`
	- :py:func:`datafusion.functions.median`
	3. Array Functions
	- :py:func:`datafusion.functions.array_agg`
	4. Logical Functions
	- :py:func:`datafusion.functions.bit_and`
	- :py:func:`datafusion.functions.bit_or`
	- :py:func:`datafusion.functions.bit_xor`
	- :py:func:`datafusion.functions.bool_and`
	- :py:func:`datafusion.functions.bool_or`
	5. Statistical Functions
	- :py:func:`datafusion.functions.count`
	- :py:func:`datafusion.functions.corr`
	- :py:func:`datafusion.functions.covar_samp`
	- :py:func:`datafusion.functions.covar_pop`
	- :py:func:`datafusion.functions.stddev`
	- :py:func:`datafusion.functions.stddev_pop`
	- :py:func:`datafusion.functions.var_samp`
	- :py:func:`datafusion.functions.var_pop`
	6. Linear Regression Functions
	- :py:func:`datafusion.functions.regr_count`
	- :py:func:`datafusion.functions.regr_slope`
	- :py:func:`datafusion.functions.regr_intercept`
	- :py:func:`datafusion.functions.regr_r2`
	- :py:func:`datafusion.functions.regr_avgx`
	- :py:func:`datafusion.functions.regr_avgy`
	- :py:func:`datafusion.functions.regr_sxx`
	- :py:func:`datafusion.functions.regr_syy`
	- :py:func:`datafusion.functions.regr_slope`
	7. Positional Functions
	- :py:func:`datafusion.functions.first_value`
	- :py:func:`datafusion.functions.last_value`
	- :py:func:`datafusion.functions.nth_value`
	8. String Functions
	- :py:func:`datafusion.functions.string_agg`
	9. Approximation Functions
	- :py:func:`datafusion.functions.approx_distinct`
	- :py:func:`datafusion.functions.approx_median`
	- :py:func:`datafusion.functions.approx_percentile_cont`
	- :py:func:`datafusion.functions.approx_percentile_cont_with_weight`