python/source/create.rst - arrow-cookbook - Git at Google

 ======================
 Creating Arrow Objects
 ======================

 Recipes related to the creation of Arrays, Tables,
 Tensors and all other Arrow entities.

 .. contents::

 Creating Arrays
 ===============

 Arrow keeps data in continuous arrays optimised for memory footprint
 and SIMD analyses. In Python it's possible to build :class:`pyarrow.Array`
 starting from Python ``lists`` (or sequence types in general),
 ``numpy`` arrays and ``pandas`` Series.

 .. testcode::

     import pyarrow as pa

     array = pa.array([1, 2, 3, 4, 5])

 .. testcode::

     print(array)

 .. testoutput::

     [
       1,
       2,
       3,
       4,
       5
     ]

 Arrays can also provide a ``mask`` to specify which values should
 be considered nulls

 .. testcode::

     import numpy as np

     array = pa.array([1, 2, 3, 4, 5],
                      mask=np.array([True, False, True, False, True]))

     print(array)

 .. testoutput::

     [
       null,
       2,
       null,
       4,
       null
     ]

 When building arrays from ``numpy`` or ``pandas``, Arrow will leverage
 optimized code paths that rely on the internal in-memory representation
 of the data by ``numpy`` and ``pandas``

 .. testcode::

     import numpy as np
     import pandas as pd

     array_from_numpy = pa.array(np.arange(5))
     array_from_pandas = pa.array(pd.Series([1, 2, 3, 4, 5]))

 Creating Tables
 ===============

 Arrow supports tabular data in :class:`pyarrow.Table`: each column
 is represented by a :class:`pyarrow.ChunkedArray` and tables can be created
 by pairing multiple arrays with names for their columns

 .. testcode::

     import pyarrow as pa

     table = pa.table([
         pa.array([1, 2, 3, 4, 5]),
         pa.array(["a", "b", "c", "d", "e"]),
         pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
     ], names=["col1", "col2", "col3"])

     print(table)

 .. testoutput::

     pyarrow.Table
     col1: int64
     col2: string
     col3: double
     ----
     col1: [[1,2,3,4,5]]
     col2: [["a","b","c","d","e"]]
     col3: [[1,2,3,4,5]]

 Create Table from Plain Types
 =============================

 Arrow allows fast zero copy creation of arrow arrays
 from numpy and pandas arrays and series, but it's also
 possible to create Arrow Arrays and Tables from
 plain Python structures.

 The :func:`pyarrow.table` function allows creation of Tables
 from a variety of inputs, including plain python objects

 .. testcode::

     import pyarrow as pa

     table = pa.table({
         "col1": [1, 2, 3, 4, 5],
         "col2": ["a", "b", "c", "d", "e"]
     })

     print(table)

 .. testoutput::

     pyarrow.Table
     col1: int64
     col2: string
     ----
     col1: [[1,2,3,4,5]]
     col2: [["a","b","c","d","e"]]

 .. note::

     All values provided in the dictionary will be passed to
     :func:`pyarrow.array` for conversion to Arrow arrays,
     and will benefit from zero copy behaviour when possible.

 The :meth:`pyarrow.Table.from_pylist` method allows the creation
 of Tables from python lists of row dicts. Types are inferred if a
 schema is not explicitly passed.

 .. testcode::

     import pyarrow as pa

     table = pa.Table.from_pylist([
         {"col1": 1, "col2": "a"},
         {"col1": 2, "col2": "b"},
         {"col1": 3, "col2": "c"},
         {"col1": 4, "col2": "d"},
         {"col1": 5, "col2": "e"}
     ])

     print(table)

 .. testoutput::

     pyarrow.Table
     col1: int64
     col2: string
     ----
     col1: [[1,2,3,4,5]]
     col2: [["a","b","c","d","e"]]

 Creating Record Batches
 =======================

 Most I/O operations in Arrow happen when shipping batches of data
 to their destination.  :class:`pyarrow.RecordBatch` is the way
 Arrow represents batches of data.  A RecordBatch can be seen as a slice
 of a table.

 .. testcode::

     import pyarrow as pa

     batch = pa.RecordBatch.from_arrays([
         pa.array([1, 3, 5, 7, 9]),
         pa.array([2, 4, 6, 8, 10])
     ], names=["odd", "even"])

 Multiple batches can be combined into a table using
 :meth:`pyarrow.Table.from_batches`

 .. testcode::

     second_batch = pa.RecordBatch.from_arrays([
         pa.array([11, 13, 15, 17, 19]),
         pa.array([12, 14, 16, 18, 20])
     ], names=["odd", "even"])

     table = pa.Table.from_batches([batch, second_batch])

 .. testcode::

     print(table)

 .. testoutput::

     pyarrow.Table
     odd: int64
     even: int64
     ----
     odd: [[1,3,5,7,9],[11,13,15,17,19]]
     even: [[2,4,6,8,10],[12,14,16,18,20]]

 Equally, :class:`pyarrow.Table` can be converted to a list of
 :class:`pyarrow.RecordBatch` using the :meth:`pyarrow.Table.to_batches`
 method

 .. testcode::

     record_batches = table.to_batches(max_chunksize=5)
     print(len(record_batches))

 .. testoutput::

     2

 Store Categorical Data
 ======================

 Arrow provides the :class:`pyarrow.DictionaryArray` type
 to represent categorical data without the cost of
 storing and repeating the categories over and over.  This can reduce memory use
 when columns might have large values (such as text).

 If you have an array containing repeated categorical data,
 it is possible to convert it to a :class:`pyarrow.DictionaryArray`
 using :meth:`pyarrow.Array.dictionary_encode`

 .. testcode::

     arr = pa.array(["red", "green", "blue", "blue", "green", "red"])

     categorical = arr.dictionary_encode()
     print(categorical)

 .. testoutput::

     ...
     -- dictionary:
       [
         "red",
         "green",
         "blue"
       ]
     -- indices:
       [
         0,
         1,
         2,
         2,
         1,
         0
       ]

 If you already know the categories and indices then you can skip the encode
 step and directly create the ``DictionaryArray`` using
 :meth:`pyarrow.DictionaryArray.from_arrays`

 .. testcode::

     categorical = pa.DictionaryArray.from_arrays(
         indices=[0, 1, 2, 2, 1, 0],
         dictionary=["red", "green", "blue"]
     )
     print(categorical)

 .. testoutput::

     ...
     -- dictionary:
       [
         "red",
         "green",
         "blue"
       ]
     -- indices:
       [
         0,
         1,
         2,
         2,
         1,
         0
       ]
	======================
	Creating Arrow Objects
	======================

	Recipes related to the creation of Arrays, Tables,
	Tensors and all other Arrow entities.

	.. contents::

	Creating Arrays
	===============

	Arrow keeps data in continuous arrays optimised for memory footprint
	and SIMD analyses. In Python it's possible to build :class:`pyarrow.Array`
	starting from Python ``lists`` (or sequence types in general),
	``numpy`` arrays and ``pandas`` Series.

	.. testcode::

	import pyarrow as pa

	array = pa.array([1, 2, 3, 4, 5])

	.. testcode::

	print(array)

	.. testoutput::

	[
	1,
	2,
	3,
	4,
	5
	]

	Arrays can also provide a ``mask`` to specify which values should
	be considered nulls

	.. testcode::

	import numpy as np

	array = pa.array([1, 2, 3, 4, 5],
	mask=np.array([True, False, True, False, True]))

	print(array)

	.. testoutput::

	[
	null,
	2,
	null,
	4,
	null
	]

	When building arrays from ``numpy`` or ``pandas``, Arrow will leverage
	optimized code paths that rely on the internal in-memory representation
	of the data by ``numpy`` and ``pandas``

	.. testcode::

	import numpy as np
	import pandas as pd

	array_from_numpy = pa.array(np.arange(5))
	array_from_pandas = pa.array(pd.Series([1, 2, 3, 4, 5]))

	Creating Tables
	===============

	Arrow supports tabular data in :class:`pyarrow.Table`: each column
	is represented by a :class:`pyarrow.ChunkedArray` and tables can be created
	by pairing multiple arrays with names for their columns

	.. testcode::

	import pyarrow as pa

	table = pa.table([
	pa.array([1, 2, 3, 4, 5]),
	pa.array(["a", "b", "c", "d", "e"]),
	pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
	], names=["col1", "col2", "col3"])

	print(table)

	.. testoutput::

	pyarrow.Table
	col1: int64
	col2: string
	col3: double
	----
	col1: [[1,2,3,4,5]]
	col2: [["a","b","c","d","e"]]
	col3: [[1,2,3,4,5]]

	Create Table from Plain Types
	=============================

	Arrow allows fast zero copy creation of arrow arrays
	from numpy and pandas arrays and series, but it's also
	possible to create Arrow Arrays and Tables from
	plain Python structures.

	The :func:`pyarrow.table` function allows creation of Tables
	from a variety of inputs, including plain python objects

	.. testcode::

	import pyarrow as pa

	table = pa.table({
	"col1": [1, 2, 3, 4, 5],
	"col2": ["a", "b", "c", "d", "e"]
	})

	print(table)

	.. testoutput::

	pyarrow.Table
	col1: int64
	col2: string
	----
	col1: [[1,2,3,4,5]]
	col2: [["a","b","c","d","e"]]

	.. note::

	All values provided in the dictionary will be passed to
	:func:`pyarrow.array` for conversion to Arrow arrays,
	and will benefit from zero copy behaviour when possible.

	The :meth:`pyarrow.Table.from_pylist` method allows the creation
	of Tables from python lists of row dicts. Types are inferred if a
	schema is not explicitly passed.

	.. testcode::

	import pyarrow as pa

	table = pa.Table.from_pylist([
	{"col1": 1, "col2": "a"},
	{"col1": 2, "col2": "b"},
	{"col1": 3, "col2": "c"},
	{"col1": 4, "col2": "d"},
	{"col1": 5, "col2": "e"}
	])

	print(table)

	.. testoutput::

	pyarrow.Table
	col1: int64
	col2: string
	----
	col1: [[1,2,3,4,5]]
	col2: [["a","b","c","d","e"]]

	Creating Record Batches
	=======================

	Most I/O operations in Arrow happen when shipping batches of data
	to their destination. :class:`pyarrow.RecordBatch` is the way
	Arrow represents batches of data. A RecordBatch can be seen as a slice
	of a table.

	.. testcode::

	import pyarrow as pa

	batch = pa.RecordBatch.from_arrays([
	pa.array([1, 3, 5, 7, 9]),
	pa.array([2, 4, 6, 8, 10])
	], names=["odd", "even"])

	Multiple batches can be combined into a table using
	:meth:`pyarrow.Table.from_batches`

	.. testcode::

	second_batch = pa.RecordBatch.from_arrays([
	pa.array([11, 13, 15, 17, 19]),
	pa.array([12, 14, 16, 18, 20])
	], names=["odd", "even"])

	table = pa.Table.from_batches([batch, second_batch])

	.. testcode::

	print(table)

	.. testoutput::

	pyarrow.Table
	odd: int64
	even: int64
	----
	odd: [[1,3,5,7,9],[11,13,15,17,19]]
	even: [[2,4,6,8,10],[12,14,16,18,20]]

	Equally, :class:`pyarrow.Table` can be converted to a list of
	:class:`pyarrow.RecordBatch` using the :meth:`pyarrow.Table.to_batches`
	method

	.. testcode::

	record_batches = table.to_batches(max_chunksize=5)
	print(len(record_batches))

	.. testoutput::

	2

	Store Categorical Data
	======================

	Arrow provides the :class:`pyarrow.DictionaryArray` type
	to represent categorical data without the cost of
	storing and repeating the categories over and over. This can reduce memory use
	when columns might have large values (such as text).

	If you have an array containing repeated categorical data,
	it is possible to convert it to a :class:`pyarrow.DictionaryArray`
	using :meth:`pyarrow.Array.dictionary_encode`

	.. testcode::

	arr = pa.array(["red", "green", "blue", "blue", "green", "red"])

	categorical = arr.dictionary_encode()
	print(categorical)

	.. testoutput::

	...
	-- dictionary:
	[
	"red",
	"green",
	"blue"
	]
	-- indices:
	[
	0,
	1,
	2,
	2,
	1,
	0
	]

	If you already know the categories and indices then you can skip the encode
	step and directly create the ``DictionaryArray`` using
	:meth:`pyarrow.DictionaryArray.from_arrays`

	.. testcode::

	categorical = pa.DictionaryArray.from_arrays(
	indices=[0, 1, 2, 2, 1, 0],
	dictionary=["red", "green", "blue"]
	)
	print(categorical)

	.. testoutput::

	...
	-- dictionary:
	[
	"red",
	"green",
	"blue"
	]
	-- indices:
	[
	0,
	1,
	2,
	2,
	1,
	0
	]