site/docs/3.4.3/api/python/_sources/user_guide/pandas_on_spark/transform_apply.rst.txt - spark-website - Git at Google

 ..  Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at

 ..    http://www.apache.org/licenses/LICENSE-2.0

 ..  Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.


 ==============================
 Transform and apply a function
 ==============================

 .. NOTE: the images are stored at https://github.com/koalas/issues/1443. Feel free to edit and/or add.

 .. currentmodule:: pyspark.pandas

 There are many APIs that allow users to apply a function against pandas-on-Spark DataFrame such as
 :func:`DataFrame.transform`, :func:`DataFrame.apply`, :func:`DataFrame.pandas_on_spark.transform_batch`,
 :func:`DataFrame.pandas_on_spark.apply_batch`, :func:`Series.pandas_on_spark.transform_batch`, etc. Each has a distinct
 purpose and works differently internally. This section describes the differences among
 them where users are confused often.

 ``transform`` and ``apply``
 ---------------------------

 The main difference between :func:`DataFrame.transform` and :func:`DataFrame.apply` is that the former requires
 to return the same length of the input and the latter does not require this. See the example below:

 .. code-block:: python

    >>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    >>> def pandas_plus(pser):
    ...     return pser + 1  # should always return the same length as input.
    ...
    >>> psdf.transform(pandas_plus)

 .. code-block:: python

    >>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[5,6,7]})
    >>> def pandas_plus(pser):
    ...     return pser[pser % 2 == 1]  # allows an arbitrary length
    ...
    >>> psdf.apply(pandas_plus)

 In this case, each function takes a pandas Series, and the pandas API on Spark computes the functions in a distributed manner as below.

 .. image:: ../../../../../docs/img/pyspark-pandas_on_spark-transform_apply1.png
   :alt: transform and apply
   :align: center
   :width: 550

 In the case of 'column' axis, the function takes each row as a pandas Series.

 .. code-block:: python

    >>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    >>> def pandas_plus(pser):
    ...     return sum(pser)  # allows an arbitrary length
    ...
    >>> psdf.apply(pandas_plus, axis='columns')

 The example above calculates the summation of each row as a pandas Series. See below:

 .. image:: ../../../../../docs/img/pyspark-pandas_on_spark-transform_apply2.png
   :alt: apply axis
   :align: center
   :width: 600

 In the examples above, the type hints were not used for simplicity but it is encouraged to use them to avoid performance penalty.
 Please refer to the API documentations.


 ``pandas_on_spark.transform_batch`` and ``pandas_on_spark.apply_batch``
 -----------------------------------------------------------------------

 In :func:`DataFrame.pandas_on_spark.transform_batch`, :func:`DataFrame.pandas_on_spark.apply_batch`, :func:`Series.pandas_on_spark.transform_batch`, etc., the ``batch``
 postfix means each chunk in pandas-on-Spark DataFrame or Series. The APIs slice the pandas-on-Spark DataFrame or Series, and
 then apply the given function with pandas DataFrame or Series as input and output. See the examples below:

 .. code-block:: python

    >>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    >>> def pandas_plus(pdf):
    ...     return pdf + 1  # should always return the same length as input.
    ...
    >>> psdf.pandas_on_spark.transform_batch(pandas_plus)

 .. code-block:: python

    >>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    >>> def pandas_plus(pdf):
    ...     return pdf[pdf.a > 1]  # allow arbitrary length
    ...
    >>> psdf.pandas_on_spark.apply_batch(pandas_plus)

 The functions in both examples take a pandas DataFrame as a chunk of pandas-on-Spark DataFrame, and output a pandas DataFrame.
 Pandas API on Spark combines the pandas DataFrames as a pandas-on-Spark DataFrame.

 Note that :func:`DataFrame.pandas_on_spark.transform_batch` has the length restriction - the length of input and output should be
 the same - whereas :func:`DataFrame.pandas_on_spark.apply_batch` does not.  However, it is important to know that
 the output belongs to the same DataFrame when :func:`DataFrame.pandas_on_spark.transform_batch` returns a Series, and
 you can avoid a shuffle by the operations between different DataFrames. In case of :func:`DataFrame.pandas_on_spark.apply_batch`, its output is always
 treated as though it belongs to a new different DataFrame. See also
 `Operations on different DataFrames <options.rst#operations-on-different-dataframes>`_ for more details.

 .. image:: ../../../../../docs/img/pyspark-pandas_on_spark-transform_apply3.png
   :alt: pandas_on_spark.transform_batch and pandas_on_spark.apply_batch in Frame
   :align: center
   :width: 650

 In case of :func:`Series.pandas_on_spark.transform_batch`, it is also similar with :func:`DataFrame.pandas_on_spark.transform_batch`; however, it takes
 a pandas Series as a chunk of pandas-on-Spark Series.

 .. code-block:: python

    >>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
    >>> def pandas_plus(pser):
    ...     return pser + 1  # should always return the same length as input.
    ...
    >>> psdf.a.pandas_on_spark.transform_batch(pandas_plus)

 Under the hood, each batch of pandas-on-Spark Series is split to multiple pandas Series, and each function computes on that as below:

 .. image:: ../../../../../docs/img/pyspark-pandas_on_spark-transform_apply4.png
   :alt: pandas_on_spark.transform_batch in Series
   :width: 350
   :align: center

 There are more details such as the type inference and preventing its performance penalty. Please refer to the API documentations.
	.. Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.


	==============================
	Transform and apply a function
	==============================

	.. NOTE: the images are stored at https://github.com/koalas/issues/1443. Feel free to edit and/or add.

	.. currentmodule:: pyspark.pandas

	There are many APIs that allow users to apply a function against pandas-on-Spark DataFrame such as
	:func:`DataFrame.transform`, :func:`DataFrame.apply`, :func:`DataFrame.pandas_on_spark.transform_batch`,
	:func:`DataFrame.pandas_on_spark.apply_batch`, :func:`Series.pandas_on_spark.transform_batch`, etc. Each has a distinct
	purpose and works differently internally. This section describes the differences among
	them where users are confused often.

	``transform`` and ``apply``
	---------------------------

	The main difference between :func:`DataFrame.transform` and :func:`DataFrame.apply` is that the former requires
	to return the same length of the input and the latter does not require this. See the example below:

	.. code-block:: python

	>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
	>>> def pandas_plus(pser):
	... return pser + 1 # should always return the same length as input.
	...
	>>> psdf.transform(pandas_plus)

	.. code-block:: python

	>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[5,6,7]})
	>>> def pandas_plus(pser):
	... return pser[pser % 2 == 1] # allows an arbitrary length
	...
	>>> psdf.apply(pandas_plus)

	In this case, each function takes a pandas Series, and the pandas API on Spark computes the functions in a distributed manner as below.

	.. image:: ../../../../../docs/img/pyspark-pandas_on_spark-transform_apply1.png
	:alt: transform and apply
	:align: center
	:width: 550

	In the case of 'column' axis, the function takes each row as a pandas Series.

	.. code-block:: python

	>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
	>>> def pandas_plus(pser):
	... return sum(pser) # allows an arbitrary length
	...
	>>> psdf.apply(pandas_plus, axis='columns')

	The example above calculates the summation of each row as a pandas Series. See below:

	.. image:: ../../../../../docs/img/pyspark-pandas_on_spark-transform_apply2.png
	:alt: apply axis
	:align: center
	:width: 600

	In the examples above, the type hints were not used for simplicity but it is encouraged to use them to avoid performance penalty.
	Please refer to the API documentations.


	``pandas_on_spark.transform_batch`` and ``pandas_on_spark.apply_batch``
	-----------------------------------------------------------------------

	In :func:`DataFrame.pandas_on_spark.transform_batch`, :func:`DataFrame.pandas_on_spark.apply_batch`, :func:`Series.pandas_on_spark.transform_batch`, etc., the ``batch``
	postfix means each chunk in pandas-on-Spark DataFrame or Series. The APIs slice the pandas-on-Spark DataFrame or Series, and
	then apply the given function with pandas DataFrame or Series as input and output. See the examples below:

	.. code-block:: python

	>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
	>>> def pandas_plus(pdf):
	... return pdf + 1 # should always return the same length as input.
	...
	>>> psdf.pandas_on_spark.transform_batch(pandas_plus)

	.. code-block:: python

	>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
	>>> def pandas_plus(pdf):
	... return pdf[pdf.a > 1] # allow arbitrary length
	...
	>>> psdf.pandas_on_spark.apply_batch(pandas_plus)

	The functions in both examples take a pandas DataFrame as a chunk of pandas-on-Spark DataFrame, and output a pandas DataFrame.
	Pandas API on Spark combines the pandas DataFrames as a pandas-on-Spark DataFrame.

	Note that :func:`DataFrame.pandas_on_spark.transform_batch` has the length restriction - the length of input and output should be
	the same - whereas :func:`DataFrame.pandas_on_spark.apply_batch` does not. However, it is important to know that
	the output belongs to the same DataFrame when :func:`DataFrame.pandas_on_spark.transform_batch` returns a Series, and
	you can avoid a shuffle by the operations between different DataFrames. In case of :func:`DataFrame.pandas_on_spark.apply_batch`, its output is always
	treated as though it belongs to a new different DataFrame. See also
	`Operations on different DataFrames <options.rst#operations-on-different-dataframes>`_ for more details.

	.. image:: ../../../../../docs/img/pyspark-pandas_on_spark-transform_apply3.png
	:alt: pandas_on_spark.transform_batch and pandas_on_spark.apply_batch in Frame
	:align: center
	:width: 650

	In case of :func:`Series.pandas_on_spark.transform_batch`, it is also similar with :func:`DataFrame.pandas_on_spark.transform_batch`; however, it takes
	a pandas Series as a chunk of pandas-on-Spark Series.

	.. code-block:: python

	>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
	>>> def pandas_plus(pser):
	... return pser + 1 # should always return the same length as input.
	...
	>>> psdf.a.pandas_on_spark.transform_batch(pandas_plus)

	Under the hood, each batch of pandas-on-Spark Series is split to multiple pandas Series, and each function computes on that as below:

	.. image:: ../../../../../docs/img/pyspark-pandas_on_spark-transform_apply4.png
	:alt: pandas_on_spark.transform_batch in Series
	:width: 350
	:align: center

	There are more details such as the type inference and preventing its performance penalty. Please refer to the API documentations.