blob: f41e748cc83c1b276e11d0b95ef4afe70b19cf0c [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
======================
Creating Arrow Objects
======================
Recipes related to the creation of Arrays, Tables,
Tensors and all other Arrow entities.
.. contents::
Creating Arrays
===============
Arrow keeps data in continuous arrays optimised for memory footprint
and SIMD analyses. In Python it's possible to build :class:`pyarrow.Array`
starting from Python ``lists`` (or sequence types in general),
``numpy`` arrays and ``pandas`` Series.
.. testcode::
import pyarrow as pa
array = pa.array([1, 2, 3, 4, 5])
.. testcode::
print(array)
.. testoutput::
[
1,
2,
3,
4,
5
]
Arrays can also provide a ``mask`` to specify which values should
be considered nulls
.. testcode::
import numpy as np
array = pa.array([1, 2, 3, 4, 5],
mask=np.array([True, False, True, False, True]))
print(array)
.. testoutput::
[
null,
2,
null,
4,
null
]
When building arrays from ``numpy`` or ``pandas``, Arrow will leverage
optimized code paths that rely on the internal in-memory representation
of the data by ``numpy`` and ``pandas``
.. testcode::
import numpy as np
import pandas as pd
array_from_numpy = pa.array(np.arange(5))
array_from_pandas = pa.array(pd.Series([1, 2, 3, 4, 5]))
Creating Tables
===============
Arrow supports tabular data in :class:`pyarrow.Table`: each column
is represented by a :class:`pyarrow.ChunkedArray` and tables can be created
by pairing multiple arrays with names for their columns
.. testcode::
import pyarrow as pa
table = pa.table([
pa.array([1, 2, 3, 4, 5]),
pa.array(["a", "b", "c", "d", "e"]),
pa.array([1.0, 2.0, 3.0, 4.0, 5.0])
], names=["col1", "col2", "col3"])
print(table)
.. testoutput::
pyarrow.Table
col1: int64
col2: string
col3: double
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]
col3: [[1,2,3,4,5]]
Create Table from Plain Types
=============================
Arrow allows fast zero copy creation of arrow arrays
from numpy and pandas arrays and series, but it's also
possible to create Arrow Arrays and Tables from
plain Python structures.
The :func:`pyarrow.table` function allows creation of Tables
from a variety of inputs, including plain python objects
.. testcode::
import pyarrow as pa
table = pa.table({
"col1": [1, 2, 3, 4, 5],
"col2": ["a", "b", "c", "d", "e"]
})
print(table)
.. testoutput::
pyarrow.Table
col1: int64
col2: string
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]
.. note::
All values provided in the dictionary will be passed to
:func:`pyarrow.array` for conversion to Arrow arrays,
and will benefit from zero copy behaviour when possible.
The :meth:`pyarrow.Table.from_pylist` method allows the creation
of Tables from python lists of row dicts. Types are inferred if a
schema is not explicitly passed.
.. testcode::
import pyarrow as pa
table = pa.Table.from_pylist([
{"col1": 1, "col2": "a"},
{"col1": 2, "col2": "b"},
{"col1": 3, "col2": "c"},
{"col1": 4, "col2": "d"},
{"col1": 5, "col2": "e"}
])
print(table)
.. testoutput::
pyarrow.Table
col1: int64
col2: string
----
col1: [[1,2,3,4,5]]
col2: [["a","b","c","d","e"]]
Creating Record Batches
=======================
Most I/O operations in Arrow happen when shipping batches of data
to their destination. :class:`pyarrow.RecordBatch` is the way
Arrow represents batches of data. A RecordBatch can be seen as a slice
of a table.
.. testcode::
import pyarrow as pa
batch = pa.RecordBatch.from_arrays([
pa.array([1, 3, 5, 7, 9]),
pa.array([2, 4, 6, 8, 10])
], names=["odd", "even"])
Multiple batches can be combined into a table using
:meth:`pyarrow.Table.from_batches`
.. testcode::
second_batch = pa.RecordBatch.from_arrays([
pa.array([11, 13, 15, 17, 19]),
pa.array([12, 14, 16, 18, 20])
], names=["odd", "even"])
table = pa.Table.from_batches([batch, second_batch])
.. testcode::
print(table)
.. testoutput::
pyarrow.Table
odd: int64
even: int64
----
odd: [[1,3,5,7,9],[11,13,15,17,19]]
even: [[2,4,6,8,10],[12,14,16,18,20]]
Equally, :class:`pyarrow.Table` can be converted to a list of
:class:`pyarrow.RecordBatch` using the :meth:`pyarrow.Table.to_batches`
method
.. testcode::
record_batches = table.to_batches(max_chunksize=5)
print(len(record_batches))
.. testoutput::
2
Store Categorical Data
======================
Arrow provides the :class:`pyarrow.DictionaryArray` type
to represent categorical data without the cost of
storing and repeating the categories over and over. This can reduce memory use
when columns might have large values (such as text).
If you have an array containing repeated categorical data,
it is possible to convert it to a :class:`pyarrow.DictionaryArray`
using :meth:`pyarrow.Array.dictionary_encode`
.. testcode::
arr = pa.array(["red", "green", "blue", "blue", "green", "red"])
categorical = arr.dictionary_encode()
print(categorical)
.. testoutput::
...
-- dictionary:
[
"red",
"green",
"blue"
]
-- indices:
[
0,
1,
2,
2,
1,
0
]
If you already know the categories and indices then you can skip the encode
step and directly create the ``DictionaryArray`` using
:meth:`pyarrow.DictionaryArray.from_arrays`
.. testcode::
categorical = pa.DictionaryArray.from_arrays(
indices=[0, 1, 2, 2, 1, 0],
dictionary=["red", "green", "blue"]
)
print(categorical)
.. testoutput::
...
-- dictionary:
[
"red",
"green",
"blue"
]
-- indices:
[
0,
1,
2,
2,
1,
0
]