docs/source/python/pandas.rst - arrow - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.

 .. _pandas_interop:

 Pandas Integration
 ==================

 To interface with `pandas <https://pandas.pydata.org/>`_, PyArrow provides
 various conversion routines to consume pandas structures and convert back
 to them.

 .. note::
    While pandas uses NumPy as a backend, it has enough peculiarities
    (such as a different type system, and support for null values) that this
    is a separate topic from :ref:`numpy_interop`.

 To follow examples in this document, make sure to run:

 .. code-block:: python

    >>> import pandas as pd
    >>> import pyarrow as pa

 DataFrames
 ----------

 The equivalent to a pandas DataFrame in Arrow is a :ref:`Table <data.table>`.
 Both consist of a set of named columns of equal length. While pandas only
 supports flat columns, the Table also provides nested columns, thus it can
 represent more data than a DataFrame, so a full conversion is not always possible.

 Conversion from a Table to a DataFrame is done by calling
 :meth:`pyarrow.Table.to_pandas`. The inverse is then achieved by using
 :meth:`pyarrow.Table.from_pandas`.

 .. code-block:: python

     >>> df = pd.DataFrame({"a": [1, 2, 3]})
     >>> # Convert from pandas to Arrow
     >>> table = pa.Table.from_pandas(df)
     >>> # Convert back to pandas
     >>> df_new = table.to_pandas()

     >>> # Infer Arrow schema from pandas
     >>> schema = pa.Schema.from_pandas(df)

 By default ``pyarrow`` tries to preserve and restore the ``.index``
 data as accurately as possible. See the section below for more about
 this, and how to disable this logic.

 Series
 ------

 In Arrow, the most similar structure to a pandas Series is an Array.
 It is a vector that contains data of the same type as linear memory. You can
 convert a pandas Series to an Arrow Array using :meth:`pyarrow.Array.from_pandas`.
 As Arrow Arrays are always nullable, you can supply an optional mask using
 the ``mask`` parameter to mark all null-entries.

 Handling pandas Indexes
 -----------------------

 Methods like :meth:`pyarrow.Table.from_pandas` have a
 ``preserve_index`` option which defines how to preserve (store) or not
 to preserve (to not store) the data in the ``index`` member of the
 corresponding pandas object. This data is tracked using schema-level
 metadata in the internal ``arrow::Schema`` object.

 The default of ``preserve_index`` is ``None``, which behaves as
 follows:

 * ``RangeIndex`` is stored as metadata-only, not requiring any extra
   storage.
 * Other index types are stored as one or more physical data columns in
   the resulting :class:`Table`

 To not store the index at all pass ``preserve_index=False``. Since
 storing a ``RangeIndex`` can cause issues in some limited scenarios
 (such as storing multiple DataFrame objects in a Parquet file), to
 force all index data to be serialized in the resulting table, pass
 ``preserve_index=True``.

 Type differences
 ----------------

 With the current design of pandas and Arrow, it is not possible to convert all
 column types unmodified. One of the main issues here is that pandas has no
 support for nullable columns of arbitrary type. Also ``datetime64`` is currently
 fixed to nanosecond resolution. On the other side, Arrow might be still missing
 support for some types.

 pandas -> Arrow Conversion
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 +------------------------+--------------------------+
 | Source Type (pandas)   | Destination Type (Arrow) |
 +========================+==========================+
 | ``bool``               | ``BOOL``                 |
 +------------------------+--------------------------+
 | ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}``   |
 +------------------------+--------------------------+
 | ``float32``            | ``FLOAT``                |
 +------------------------+--------------------------+
 | ``float64``            | ``DOUBLE``               |
 +------------------------+--------------------------+
 | ``str`` / ``unicode``  | ``STRING``               |
 +------------------------+--------------------------+
 | ``pd.Categorical``     | ``DICTIONARY``           |
 +------------------------+--------------------------+
 | ``pd.Timestamp``       | ``TIMESTAMP(unit=ns)``   |
 +------------------------+--------------------------+
 | ``datetime.date``      | ``DATE``                 |
 +------------------------+--------------------------+
 | ``datetime.time``      | ``TIME64``               |
 +------------------------+--------------------------+

 Arrow -> pandas Conversion
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 +-------------------------------------+--------------------------------------------------------+
 | Source Type (Arrow)                 | Destination Type (pandas)                              |
 +=====================================+========================================================+
 | ``BOOL``                            | ``bool``                                               |
 +-------------------------------------+--------------------------------------------------------+
 | ``BOOL`` *with nulls*               | ``object`` (with values ``True``, ``False``, ``None``) |
 +-------------------------------------+--------------------------------------------------------+
 | ``(U)INT{8,16,32,64}``              | ``(u)int{8,16,32,64}``                                 |
 +-------------------------------------+--------------------------------------------------------+
 | ``(U)INT{8,16,32,64}`` *with nulls* | ``float64``                                            |
 +-------------------------------------+--------------------------------------------------------+
 | ``FLOAT``                           | ``float32``                                            |
 +-------------------------------------+--------------------------------------------------------+
 | ``DOUBLE``                          | ``float64``                                            |
 +-------------------------------------+--------------------------------------------------------+
 | ``STRING``                          | ``str``                                                |
 +-------------------------------------+--------------------------------------------------------+
 | ``DICTIONARY``                      | ``pd.Categorical``                                     |
 +-------------------------------------+--------------------------------------------------------+
 | ``TIMESTAMP(unit=*)``               | ``pd.Timestamp`` (``np.datetime64[ns]``)               |
 +-------------------------------------+--------------------------------------------------------+
 | ``DATE``                            | ``object`` (with ``datetime.date`` objects)            |
 +-------------------------------------+--------------------------------------------------------+
 | ``TIME64``                          | ``object`` (with ``datetime.time`` objects)            |
 +-------------------------------------+--------------------------------------------------------+

 Categorical types
 ~~~~~~~~~~~~~~~~~

 `Pandas categorical <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_
 columns are converted to :ref:`Arrow dictionary arrays <data.dictionary>`,
 a special array type optimized to handle repeated and limited
 number of possible values.

 .. code-block:: python

    >>> df = pd.DataFrame({"cat": pd.Categorical(["a", "b", "c", "a", "b", "c"])})
    >>> df.cat.dtype.categories
    Index(['a', 'b', 'c'], dtype='object')
    >>> df
      cat
    0   a
    1   b
    2   c
    3   a
    4   b
    5   c
    >>> table = pa.Table.from_pandas(df)
    >>> table
    pyarrow.Table
    cat: dictionary<values=string, indices=int8, ordered=0>
    ----
    cat: [  -- dictionary:
    ["a","b","c"]  -- indices:
    [0,1,2,0,1,2]]

 We can inspect the :class:`~.ChunkedArray` of the created table and see the
 same categories of the Pandas DataFrame.

 .. code-block:: python

    >>> column = table[0]
    >>> chunk = column.chunk(0)
    >>> chunk.dictionary
    <pyarrow.lib.StringArray object at ...>
    [
      "a",
      "b",
      "c"
    ]
    >>> chunk.indices
    <pyarrow.lib.Int8Array object at ...>
    [
      0,
      1,
      2,
      0,
      1,
      2
    ]

 Datetime (Timestamp) types
 ~~~~~~~~~~~~~~~~~~~~~~~~~~

 `Pandas Timestamps <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html>`_
 use the ``datetime64[ns]`` type in Pandas and are converted to an Arrow
 :class:`~.TimestampArray`.

 .. code-block:: python

    >>> df = pd.DataFrame({"datetime": pd.date_range("2020-01-01T00:00:00Z", freq="h", periods=3)})
    >>> df.dtypes
    datetime    datetime64[ns, UTC]
    dtype: object
    >>> df
                       datetime
    0 2020-01-01 00:00:00+00:00
    1 2020-01-01 01:00:00+00:00
    2 2020-01-01 02:00:00+00:00
    >>> table = pa.Table.from_pandas(df)
    >>> table
    pyarrow.Table
    datetime: timestamp[ns, tz=UTC]
    ----
    datetime: [[2020-01-01 00:00:00.000000000Z,...,2020-01-01 02:00:00.000000000Z]]

 In this example the Pandas Timestamp is time zone aware
 (``UTC`` on this case), and this information is used to create the Arrow
 :class:`~.TimestampArray`.

 Date types
 ~~~~~~~~~~

 While dates can be handled using the ``datetime64[ns]`` type in
 pandas, some systems work with object arrays of Python's built-in
 ``datetime.date`` object:

 .. code-block:: python

    >>> from datetime import date
    >>> s = pd.Series([date(2018, 12, 31), None, date(2000, 1, 1)])
    >>> s
    0    2018-12-31
    1          None
    2    2000-01-01
    dtype: object

 When converting to an Arrow array, the ``date32`` type will be used by
 default:

 .. code-block:: python

    >>> arr = pa.array(s)
    >>> arr.type
    DataType(date32[day])
    >>> arr[0]
    <pyarrow.Date32Scalar: datetime.date(2018, 12, 31)>

 To use the 64-bit ``date64``, specify this explicitly:

 .. code-block:: python

    >>> arr = pa.array(s, type='date64')
    >>> arr.type
    DataType(date64[ms])

 When converting back with ``to_pandas``, object arrays of
 ``datetime.date`` objects are returned:

 .. code-block:: python

    >>> arr.to_pandas()
    0    2018-12-31
    1          None
    2    2000-01-01
    dtype: object

 If you want to use NumPy's ``datetime64`` dtype instead, pass
 ``date_as_object=False``:

 .. code-block:: python

    >>> s2 = pd.Series(arr.to_pandas(date_as_object=False))
    >>> s2.dtype
    dtype('<M8[ms]')

 .. warning::

    As of Arrow ``0.13`` the parameter ``date_as_object`` is ``True``
    by default. Older versions must pass ``date_as_object=True`` to
    obtain this behavior

 Time types
 ~~~~~~~~~~

 The builtin ``datetime.time`` objects inside Pandas data structures will be
 converted to an Arrow ``time64`` and :class:`~.Time64Array` respectively.

 .. code-block:: python

    >>> from datetime import time
    >>> s = pd.Series([time(1, 1, 1), time(2, 2, 2)])
    >>> s
    0    01:01:01
    1    02:02:02
    dtype: object
    >>> arr = pa.array(s)
    >>> arr.type
    Time64Type(time64[us])
    >>> arr
    <pyarrow.lib.Time64Array object at ...>
    [
      01:01:01.000000,
      02:02:02.000000
    ]

 When converting to pandas, arrays of ``datetime.time`` objects are returned:

 .. code-block:: python

    >>> arr.to_pandas()
    0    01:01:01
    1    02:02:02
    dtype: object

 Nullable types
 --------------

 In Arrow all data types are nullable, meaning they support storing missing
 values. In pandas, however, not all data types have support for missing data.
 Most notably, the default integer data types do not, and will get casted
 to float when missing values are introduced. Therefore, when an Arrow array
 or table gets converted to pandas, integer columns will become float when
 missing values are present:

 .. code-block:: python

    >>> arr = pa.array([1, 2, None])
    >>> arr
    <pyarrow.lib.Int64Array object at ...>
    [
      1,
      2,
      null
    ]
    >>> arr.to_pandas()
    0    1.0
    1    2.0
    2    NaN
    dtype: float64

 Pandas has experimental nullable data types
 (https://pandas.pydata.org/docs/user_guide/integer_na.html). Arrows supports
 round trip conversion for those:

 .. code-block:: python

    >>> df = pd.DataFrame({'a': pd.Series([1, 2, None], dtype="Int64")})
    >>> df
          a
    0     1
    1     2
    2  <NA>

    >>> table = pa.table(df)
    >>> table
    pyarrow.Table
    a: int64
    ----
    a: [[1,2,null]]

    >>> table.to_pandas()
          a
    0     1
    1     2
    2  <NA>

    >>> table.to_pandas().dtypes
    a    Int64
    dtype: object

 This roundtrip conversion works because metadata about the original pandas
 DataFrame gets stored in the Arrow table. However, if you have Arrow data (or
 e.g. a Parquet file) not originating from a pandas DataFrame with nullable
 data types, the default conversion to pandas will not use those nullable
 dtypes.

 The :meth:`pyarrow.Table.to_pandas` method has a ``types_mapper`` keyword
 that can be used to override the default data type used for the resulting
 pandas DataFrame. This way, you can instruct Arrow to create a pandas
 DataFrame using nullable dtypes.

 .. code-block:: python

    >>> table = pa.table({"a": [1, 2, None]})
    >>> table.to_pandas()
         a
    0  1.0
    1  2.0
    2  NaN
    >>> table.to_pandas(types_mapper={pa.int64(): pd.Int64Dtype()}.get)
          a
    0     1
    1     2
    2  <NA>

 The ``types_mapper`` keyword expects a function that will return the pandas
 data type to use given a pyarrow data type. By using the ``dict.get`` method,
 we can create such a function using a dictionary.

 If you want to use all currently supported nullable dtypes by pandas, this
 dictionary becomes:

 .. code-block:: python

    >>> dtype_mapping = {
    ...     pa.int8(): pd.Int8Dtype(),
    ...     pa.int16(): pd.Int16Dtype(),
    ...     pa.int32(): pd.Int32Dtype(),
    ...     pa.int64(): pd.Int64Dtype(),
    ...     pa.uint8(): pd.UInt8Dtype(),
    ...     pa.uint16(): pd.UInt16Dtype(),
    ...     pa.uint32(): pd.UInt32Dtype(),
    ...     pa.uint64(): pd.UInt64Dtype(),
    ...     pa.bool_(): pd.BooleanDtype(),
    ...     pa.float32(): pd.Float32Dtype(),
    ...     pa.float64(): pd.Float64Dtype(),
    ...     pa.string(): pd.StringDtype(),
    ... }
    >>> df = table.to_pandas(types_mapper=dtype_mapping.get)


 When using the pandas API for reading Parquet files (``pd.read_parquet(..)``),
 this can also be achieved by passing ``use_nullable_dtypes``:

 .. code-block:: python

    >>> df = pd.read_parquet(path, use_nullable_dtypes=True)  # doctest: +SKIP


 Memory Usage and Zero Copy
 --------------------------

 When converting from Arrow data structures to pandas objects using various
 ``to_pandas`` methods, one must occasionally be mindful of issues related to
 performance and memory usage.

 Since pandas's internal data representation is generally different from the
 Arrow columnar format, zero copy conversions (where no memory allocation or
 computation is required) are only possible in certain limited cases.

 In the worst case scenario, calling ``to_pandas`` will result in two versions
 of the data in memory, one for Arrow and one for pandas, yielding approximately
 twice the memory footprint. We have implement some mitigations for this case,
 particularly when creating large ``DataFrame`` objects, that we describe below.

 Zero Copy Series Conversions
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Zero copy conversions from ``Array`` or ``ChunkedArray`` to NumPy arrays or
 pandas Series are possible in certain narrow cases:

 * The Arrow data is stored in an integer (signed or unsigned ``int8`` through
   ``int64``) or floating point type (``float16`` through ``float64``). This
   includes many numeric types as well as timestamps.
 * The Arrow data has no null values (since these are represented using bitmaps
   which are not supported by pandas).
 * For ``ChunkedArray``, the data consists of a single chunk,
   i.e. ``arr.num_chunks == 1``. Multiple chunks will always require a copy
   because of pandas's contiguousness requirement.

 In these scenarios, ``to_pandas`` or ``to_numpy`` will be zero copy. In all
 other scenarios, a copy will be required.

 Reducing Memory Use in ``Table.to_pandas``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 As of this writing, pandas applies a data management strategy called
 "consolidation" to collect like-typed DataFrame columns in two-dimensional
 NumPy arrays, referred to internally as "blocks". We have gone to great effort
 to construct the precise "consolidated" blocks so that pandas will not perform
 any further allocation or copies after we hand off the data to
 ``pandas.DataFrame``. The obvious downside of this consolidation strategy is
 that it forces a "memory doubling".

 To try to limit the potential effects of "memory doubling" during
 ``Table.to_pandas``, we provide a couple of options:

 * ``split_blocks=True``, when enabled ``Table.to_pandas`` produces one internal
   DataFrame "block" for each column, skipping the "consolidation" step. Note
   that many pandas operations will trigger consolidation anyway, but the peak
   memory use may be less than the worst case scenario of a full memory
   doubling. As a result of this option, we are able to do zero copy conversions
   of columns in the same cases where we can do zero copy with ``Array`` and
   ``ChunkedArray``.
 * ``self_destruct=True``, this destroys the internal Arrow memory buffers in
   each column ``Table`` object as they are converted to the pandas-compatible
   representation, potentially releasing memory to the operating system as soon
   as a column is converted. Note that this renders the calling ``Table`` object
   unsafe for further use, and any further methods called will cause your Python
   process to crash.

 Used together, the call

 .. code-block:: python

    >>> df = table.to_pandas(split_blocks=True, self_destruct=True)
    >>> del table  # not necessary, but a good practice

 will yield significantly lower memory usage in some scenarios. Without these
 options, ``to_pandas`` will always double memory.

 Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
 conversion happens column by column, memory is also freed column by column. But
 if multiple columns share an underlying buffer, then no memory will be freed
 until all of those columns are converted. In particular, due to implementation
 details, data that comes from IPC or Flight is prone to this, as memory will be
 laid out as follows::

   Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
   Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
   ...

 In this case, no memory can be freed until the entire table is converted, even
 with ``self_destruct=True``.
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.

	.. _pandas_interop:

	Pandas Integration
	==================

	To interface with `pandas <https://pandas.pydata.org/>`_, PyArrow provides
	various conversion routines to consume pandas structures and convert back
	to them.

	.. note::
	While pandas uses NumPy as a backend, it has enough peculiarities
	(such as a different type system, and support for null values) that this
	is a separate topic from :ref:`numpy_interop`.

	To follow examples in this document, make sure to run:

	.. code-block:: python

	>>> import pandas as pd
	>>> import pyarrow as pa

	DataFrames
	----------

	The equivalent to a pandas DataFrame in Arrow is a :ref:`Table <data.table>`.
	Both consist of a set of named columns of equal length. While pandas only
	supports flat columns, the Table also provides nested columns, thus it can
	represent more data than a DataFrame, so a full conversion is not always possible.

	Conversion from a Table to a DataFrame is done by calling
	:meth:`pyarrow.Table.to_pandas`. The inverse is then achieved by using
	:meth:`pyarrow.Table.from_pandas`.

	.. code-block:: python

	>>> df = pd.DataFrame({"a": [1, 2, 3]})
	>>> # Convert from pandas to Arrow
	>>> table = pa.Table.from_pandas(df)
	>>> # Convert back to pandas
	>>> df_new = table.to_pandas()

	>>> # Infer Arrow schema from pandas
	>>> schema = pa.Schema.from_pandas(df)

	By default ``pyarrow`` tries to preserve and restore the ``.index``
	data as accurately as possible. See the section below for more about
	this, and how to disable this logic.

	Series
	------

	In Arrow, the most similar structure to a pandas Series is an Array.
	It is a vector that contains data of the same type as linear memory. You can
	convert a pandas Series to an Arrow Array using :meth:`pyarrow.Array.from_pandas`.
	As Arrow Arrays are always nullable, you can supply an optional mask using
	the ``mask`` parameter to mark all null-entries.

	Handling pandas Indexes
	-----------------------

	Methods like :meth:`pyarrow.Table.from_pandas` have a
	``preserve_index`` option which defines how to preserve (store) or not
	to preserve (to not store) the data in the ``index`` member of the
	corresponding pandas object. This data is tracked using schema-level
	metadata in the internal ``arrow::Schema`` object.

	The default of ``preserve_index`` is ``None``, which behaves as
	follows:

	* ``RangeIndex`` is stored as metadata-only, not requiring any extra
	storage.
	* Other index types are stored as one or more physical data columns in
	the resulting :class:`Table`

	To not store the index at all pass ``preserve_index=False``. Since
	storing a ``RangeIndex`` can cause issues in some limited scenarios
	(such as storing multiple DataFrame objects in a Parquet file), to
	force all index data to be serialized in the resulting table, pass
	``preserve_index=True``.

	Type differences
	----------------

	With the current design of pandas and Arrow, it is not possible to convert all
	column types unmodified. One of the main issues here is that pandas has no
	support for nullable columns of arbitrary type. Also ``datetime64`` is currently
	fixed to nanosecond resolution. On the other side, Arrow might be still missing
	support for some types.

	pandas -> Arrow Conversion
	~~~~~~~~~~~~~~~~~~~~~~~~~~

	+------------------------+--------------------------+
	\| Source Type (pandas) \| Destination Type (Arrow) \|
	+========================+==========================+
	\| ``bool`` \| ``BOOL`` \|
	+------------------------+--------------------------+
	\| ``(u)int{8,16,32,64}`` \| ``(U)INT{8,16,32,64}`` \|
	+------------------------+--------------------------+
	\| ``float32`` \| ``FLOAT`` \|
	+------------------------+--------------------------+
	\| ``float64`` \| ``DOUBLE`` \|
	+------------------------+--------------------------+
	\| ``str`` / ``unicode`` \| ``STRING`` \|
	+------------------------+--------------------------+
	\| ``pd.Categorical`` \| ``DICTIONARY`` \|
	+------------------------+--------------------------+
	\| ``pd.Timestamp`` \| ``TIMESTAMP(unit=ns)`` \|
	+------------------------+--------------------------+
	\| ``datetime.date`` \| ``DATE`` \|
	+------------------------+--------------------------+
	\| ``datetime.time`` \| ``TIME64`` \|
	+------------------------+--------------------------+

	Arrow -> pandas Conversion
	~~~~~~~~~~~~~~~~~~~~~~~~~~

	+-------------------------------------+--------------------------------------------------------+
	\| Source Type (Arrow) \| Destination Type (pandas) \|
	+=====================================+========================================================+
	\| ``BOOL`` \| ``bool`` \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``BOOL`` with nulls \| ``object`` (with values ``True``, ``False``, ``None``) \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``(U)INT{8,16,32,64}`` \| ``(u)int{8,16,32,64}`` \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``(U)INT{8,16,32,64}`` with nulls \| ``float64`` \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``FLOAT`` \| ``float32`` \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``DOUBLE`` \| ``float64`` \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``STRING`` \| ``str`` \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``DICTIONARY`` \| ``pd.Categorical`` \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``TIMESTAMP(unit=*)`` \| ``pd.Timestamp`` (``np.datetime64[ns]``) \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``DATE`` \| ``object`` (with ``datetime.date`` objects) \|
	+-------------------------------------+--------------------------------------------------------+
	\| ``TIME64`` \| ``object`` (with ``datetime.time`` objects) \|
	+-------------------------------------+--------------------------------------------------------+

	Categorical types
	~~~~~~~~~~~~~~~~~

	`Pandas categorical <https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html>`_
	columns are converted to :ref:`Arrow dictionary arrays <data.dictionary>`,
	a special array type optimized to handle repeated and limited
	number of possible values.

	.. code-block:: python

	>>> df = pd.DataFrame({"cat": pd.Categorical(["a", "b", "c", "a", "b", "c"])})
	>>> df.cat.dtype.categories
	Index(['a', 'b', 'c'], dtype='object')
	>>> df
	cat
	0 a
	1 b
	2 c
	3 a
	4 b
	5 c
	>>> table = pa.Table.from_pandas(df)
	>>> table
	pyarrow.Table
	cat: dictionary<values=string, indices=int8, ordered=0>
	----
	cat: [ -- dictionary:
	["a","b","c"] -- indices:
	[0,1,2,0,1,2]]

	We can inspect the :class:`~.ChunkedArray` of the created table and see the
	same categories of the Pandas DataFrame.

	.. code-block:: python

	>>> column = table[0]
	>>> chunk = column.chunk(0)
	>>> chunk.dictionary
	<pyarrow.lib.StringArray object at ...>
	[
	"a",
	"b",
	"c"
	]
	>>> chunk.indices
	<pyarrow.lib.Int8Array object at ...>
	[
	0,
	1,
	2,
	0,
	1,
	2
	]

	Datetime (Timestamp) types
	~~~~~~~~~~~~~~~~~~~~~~~~~~

	`Pandas Timestamps <https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html>`_
	use the ``datetime64[ns]`` type in Pandas and are converted to an Arrow
	:class:`~.TimestampArray`.

	.. code-block:: python

	>>> df = pd.DataFrame({"datetime": pd.date_range("2020-01-01T00:00:00Z", freq="h", periods=3)})
	>>> df.dtypes
	datetime datetime64[ns, UTC]
	dtype: object
	>>> df
	datetime
	0 2020-01-01 00:00:00+00:00
	1 2020-01-01 01:00:00+00:00
	2 2020-01-01 02:00:00+00:00
	>>> table = pa.Table.from_pandas(df)
	>>> table
	pyarrow.Table
	datetime: timestamp[ns, tz=UTC]
	----
	datetime: [[2020-01-01 00:00:00.000000000Z,...,2020-01-01 02:00:00.000000000Z]]

	In this example the Pandas Timestamp is time zone aware
	(``UTC`` on this case), and this information is used to create the Arrow
	:class:`~.TimestampArray`.

	Date types
	~~~~~~~~~~

	While dates can be handled using the ``datetime64[ns]`` type in
	pandas, some systems work with object arrays of Python's built-in
	``datetime.date`` object:

	.. code-block:: python

	>>> from datetime import date
	>>> s = pd.Series([date(2018, 12, 31), None, date(2000, 1, 1)])
	>>> s
	0 2018-12-31
	1 None
	2 2000-01-01
	dtype: object

	When converting to an Arrow array, the ``date32`` type will be used by
	default:

	.. code-block:: python

	>>> arr = pa.array(s)
	>>> arr.type
	DataType(date32[day])
	>>> arr[0]
	<pyarrow.Date32Scalar: datetime.date(2018, 12, 31)>

	To use the 64-bit ``date64``, specify this explicitly:

	.. code-block:: python

	>>> arr = pa.array(s, type='date64')
	>>> arr.type
	DataType(date64[ms])

	When converting back with ``to_pandas``, object arrays of
	``datetime.date`` objects are returned:

	.. code-block:: python

	>>> arr.to_pandas()
	0 2018-12-31
	1 None
	2 2000-01-01
	dtype: object

	If you want to use NumPy's ``datetime64`` dtype instead, pass
	``date_as_object=False``:

	.. code-block:: python

	>>> s2 = pd.Series(arr.to_pandas(date_as_object=False))
	>>> s2.dtype
	dtype('<M8[ms]')

	.. warning::

	As of Arrow ``0.13`` the parameter ``date_as_object`` is ``True``
	by default. Older versions must pass ``date_as_object=True`` to
	obtain this behavior

	Time types
	~~~~~~~~~~

	The builtin ``datetime.time`` objects inside Pandas data structures will be
	converted to an Arrow ``time64`` and :class:`~.Time64Array` respectively.

	.. code-block:: python

	>>> from datetime import time
	>>> s = pd.Series([time(1, 1, 1), time(2, 2, 2)])
	>>> s
	0 01:01:01
	1 02:02:02
	dtype: object
	>>> arr = pa.array(s)
	>>> arr.type
	Time64Type(time64[us])
	>>> arr
	<pyarrow.lib.Time64Array object at ...>
	[
	01:01:01.000000,
	02:02:02.000000
	]

	When converting to pandas, arrays of ``datetime.time`` objects are returned:

	.. code-block:: python

	>>> arr.to_pandas()
	0 01:01:01
	1 02:02:02
	dtype: object

	Nullable types
	--------------

	In Arrow all data types are nullable, meaning they support storing missing
	values. In pandas, however, not all data types have support for missing data.
	Most notably, the default integer data types do not, and will get casted
	to float when missing values are introduced. Therefore, when an Arrow array
	or table gets converted to pandas, integer columns will become float when
	missing values are present:

	.. code-block:: python

	>>> arr = pa.array([1, 2, None])
	>>> arr
	<pyarrow.lib.Int64Array object at ...>
	[
	1,
	2,
	null
	]
	>>> arr.to_pandas()
	0 1.0
	1 2.0
	2 NaN
	dtype: float64

	Pandas has experimental nullable data types
	(https://pandas.pydata.org/docs/user_guide/integer_na.html). Arrows supports
	round trip conversion for those:

	.. code-block:: python

	>>> df = pd.DataFrame({'a': pd.Series([1, 2, None], dtype="Int64")})
	>>> df
	a
	0 1
	1 2
	2 <NA>

	>>> table = pa.table(df)
	>>> table
	pyarrow.Table
	a: int64
	----
	a: [[1,2,null]]

	>>> table.to_pandas()
	a
	0 1
	1 2
	2 <NA>

	>>> table.to_pandas().dtypes
	a Int64
	dtype: object

	This roundtrip conversion works because metadata about the original pandas
	DataFrame gets stored in the Arrow table. However, if you have Arrow data (or
	e.g. a Parquet file) not originating from a pandas DataFrame with nullable
	data types, the default conversion to pandas will not use those nullable
	dtypes.

	The :meth:`pyarrow.Table.to_pandas` method has a ``types_mapper`` keyword
	that can be used to override the default data type used for the resulting
	pandas DataFrame. This way, you can instruct Arrow to create a pandas
	DataFrame using nullable dtypes.

	.. code-block:: python

	>>> table = pa.table({"a": [1, 2, None]})
	>>> table.to_pandas()
	a
	0 1.0
	1 2.0
	2 NaN
	>>> table.to_pandas(types_mapper={pa.int64(): pd.Int64Dtype()}.get)
	a
	0 1
	1 2
	2 <NA>

	The ``types_mapper`` keyword expects a function that will return the pandas
	data type to use given a pyarrow data type. By using the ``dict.get`` method,
	we can create such a function using a dictionary.

	If you want to use all currently supported nullable dtypes by pandas, this
	dictionary becomes:

	.. code-block:: python

	>>> dtype_mapping = {
	... pa.int8(): pd.Int8Dtype(),
	... pa.int16(): pd.Int16Dtype(),
	... pa.int32(): pd.Int32Dtype(),
	... pa.int64(): pd.Int64Dtype(),
	... pa.uint8(): pd.UInt8Dtype(),
	... pa.uint16(): pd.UInt16Dtype(),
	... pa.uint32(): pd.UInt32Dtype(),
	... pa.uint64(): pd.UInt64Dtype(),
	... pa.bool_(): pd.BooleanDtype(),
	... pa.float32(): pd.Float32Dtype(),
	... pa.float64(): pd.Float64Dtype(),
	... pa.string(): pd.StringDtype(),
	... }
	>>> df = table.to_pandas(types_mapper=dtype_mapping.get)


	When using the pandas API for reading Parquet files (``pd.read_parquet(..)``),
	this can also be achieved by passing ``use_nullable_dtypes``:

	.. code-block:: python

	>>> df = pd.read_parquet(path, use_nullable_dtypes=True) # doctest: +SKIP


	Memory Usage and Zero Copy
	--------------------------

	When converting from Arrow data structures to pandas objects using various
	``to_pandas`` methods, one must occasionally be mindful of issues related to
	performance and memory usage.

	Since pandas's internal data representation is generally different from the
	Arrow columnar format, zero copy conversions (where no memory allocation or
	computation is required) are only possible in certain limited cases.

	In the worst case scenario, calling ``to_pandas`` will result in two versions
	of the data in memory, one for Arrow and one for pandas, yielding approximately
	twice the memory footprint. We have implement some mitigations for this case,
	particularly when creating large ``DataFrame`` objects, that we describe below.

	Zero Copy Series Conversions
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	Zero copy conversions from ``Array`` or ``ChunkedArray`` to NumPy arrays or
	pandas Series are possible in certain narrow cases:

	* The Arrow data is stored in an integer (signed or unsigned ``int8`` through
	``int64``) or floating point type (``float16`` through ``float64``). This
	includes many numeric types as well as timestamps.
	* The Arrow data has no null values (since these are represented using bitmaps
	which are not supported by pandas).
	* For ``ChunkedArray``, the data consists of a single chunk,
	i.e. ``arr.num_chunks == 1``. Multiple chunks will always require a copy
	because of pandas's contiguousness requirement.

	In these scenarios, ``to_pandas`` or ``to_numpy`` will be zero copy. In all
	other scenarios, a copy will be required.

	Reducing Memory Use in ``Table.to_pandas``
	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	As of this writing, pandas applies a data management strategy called
	"consolidation" to collect like-typed DataFrame columns in two-dimensional
	NumPy arrays, referred to internally as "blocks". We have gone to great effort
	to construct the precise "consolidated" blocks so that pandas will not perform
	any further allocation or copies after we hand off the data to
	``pandas.DataFrame``. The obvious downside of this consolidation strategy is
	that it forces a "memory doubling".

	To try to limit the potential effects of "memory doubling" during
	``Table.to_pandas``, we provide a couple of options:

	* ``split_blocks=True``, when enabled ``Table.to_pandas`` produces one internal
	DataFrame "block" for each column, skipping the "consolidation" step. Note
	that many pandas operations will trigger consolidation anyway, but the peak
	memory use may be less than the worst case scenario of a full memory
	doubling. As a result of this option, we are able to do zero copy conversions
	of columns in the same cases where we can do zero copy with ``Array`` and
	``ChunkedArray``.
	* ``self_destruct=True``, this destroys the internal Arrow memory buffers in
	each column ``Table`` object as they are converted to the pandas-compatible
	representation, potentially releasing memory to the operating system as soon
	as a column is converted. Note that this renders the calling ``Table`` object
	unsafe for further use, and any further methods called will cause your Python
	process to crash.

	Used together, the call

	.. code-block:: python

	>>> df = table.to_pandas(split_blocks=True, self_destruct=True)
	>>> del table # not necessary, but a good practice

	will yield significantly lower memory usage in some scenarios. Without these
	options, ``to_pandas`` will always double memory.

	Note that ``self_destruct=True`` is not guaranteed to save memory. Since the
	conversion happens column by column, memory is also freed column by column. But
	if multiple columns share an underlying buffer, then no memory will be freed
	until all of those columns are converted. In particular, due to implementation
	details, data that comes from IPC or Flight is prone to this, as memory will be
	laid out as follows::

	Record Batch 0: Allocation 0: array 0 chunk 0, array 1 chunk 0, ...
	Record Batch 1: Allocation 1: array 0 chunk 1, array 1 chunk 1, ...
	...

	In this case, no memory can be freed until the entire table is converted, even
	with ``self_destruct=True``.