nanoarrow for Python

The nanoarrow Python package provides bindings to the nanoarrow C library. Like the nanoarrow C library, it provides tools to facilitate the use of the Arrow C Data and Arrow C Stream interfaces.

Installation

Python bindings for nanoarrow are not yet available on PyPI. You can install via URL (requires a C compiler):

python -m pip install "git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python"

If you can import the namespace, you're good to go!

import nanoarrow as na

Low-level C library bindings

The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the ArrowSchema which represents a data type of an array, the ArrowArray which represents the values of an array, and an ArrowArrayStream, which represents zero or more ArrowArrays with a common ArrowSchema.

Schemas

Use nanoarrow.c_schema() to convert an object to an ArrowSchema and wrap it as a Python object. This works for any object implementing the Arrow PyCapsule Interface (e.g., pyarrow.Schema, pyarrow.DataType, and pyarrow.Field).

import pyarrow as pa
schema = na.c_schema(pa.decimal128(10, 3))
schema

<nanoarrow.c_lib.CSchema decimal128(10, 3)>
- format: 'd:10,3'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:

You can extract the fields of a CSchema object one at a time or parse it into a view to extract deserialized parameters.

na.c_schema_view(schema)

<nanoarrow.c_lib.CSchemaView>
- type: 'decimal128'
- storage_type: 'decimal128'
- decimal_bitwidth: 128
- decimal_precision: 10
- decimal_scale: 3

Advanced users can allocate an empty CSchema and populate its contents by passing its ._addr() to a schema-exporting function.

schema = na.allocate_c_schema()
pa.int32()._export_to_c(schema._addr())
schema

<nanoarrow.c_lib.CSchema int32>
- format: 'i'
- name: ''
- flags: 2
- metadata: NULL
- dictionary: NULL
- children[0]:

The CSchema object cleans up after itself: when the object is deleted, the underlying ArrowSchema is released.

Arrays

You can use nanoarrow.c_array() to convert an array-like object to an ArrowArray, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the Arrow PyCapsule Interface (e.g., pyarrow.Array, pyarrow.RecordBatch).

array = na.c_array(pa.array(["one", "two", "three", None]))
array

<nanoarrow.c_lib.CArray string>
- length: 4
- offset: 0
- null_count: 1
- buffers: (2939032895680, 2939032895616, 2939032895744)
- dictionary: NULL
- children[0]:

You can extract the fields of a CArray one at a time or parse it into a view to extract deserialized content:

na.c_array_view(array)

<nanoarrow.c_lib.CArrayView>
- storage_type: 'string'
- length: 4
- offset: 0
- null_count: 1
- buffers[3]:
  - <bool validity[1 b] 11100000>
  - <int32 data_offset[20 b] 0 3 6 11 11>
  - <string data[11 b] b'onetwothree'>
- dictionary: NULL
- children[0]:

Like the CSchema, you can allocate an empty one and access its address with _addr() to pass to other array-exporting functions.

array = na.allocate_c_array()
pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())
array.length

Array streams

You can use nanoarrow.c_array_stream() to wrap an object representing a sequence of CArrays with a common CSchema to an ArrowArrayStream and wrap it as a Python object. This works for any object implementing the Arrow PyCapsule Interface (e.g., pyarrow.RecordBatchReader).

pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
array_stream = na.c_array_stream(reader)
array_stream

<nanoarrow.c_lib.CArrayStream>
- get_schema(): <nanoarrow.c_lib.CSchema struct>
  - format: '+s'
  - name: ''
  - flags: 0
  - metadata: NULL
  - dictionary: NULL
  - children[1]:
    'some_column': <nanoarrow.c_lib.CSchema int32>
      - format: 'i'
      - name: 'some_column'
      - flags: 2
      - metadata: NULL
      - dictionary: NULL
      - children[0]:

You can pull the next array from the stream using .get_next() or use it like an iterator. The .get_next() method will raise StopIteration when there are no more arrays in the stream.

for array in array_stream:
    print(array)

<nanoarrow.c_lib.CArray struct>
- length: 3
- offset: 0
- null_count: 0
- buffers: (0,)
- dictionary: NULL
- children[1]:
  'some_column': <nanoarrow.c_lib.CArray int32>
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers: (0, 2939033026688)
    - dictionary: NULL
    - children[0]:

You can also get the address of a freshly-allocated stream to pass to a suitable exporting function:

array_stream = na.allocate_c_array_stream()
reader._export_to_c(array_stream._addr())
array_stream

<nanoarrow.c_lib.CArrayStream>
- get_schema(): <nanoarrow.c_lib.CSchema struct>
  - format: '+s'
  - name: ''
  - flags: 0
  - metadata: NULL
  - dictionary: NULL
  - children[1]:
    'some_column': <nanoarrow.c_lib.CSchema int32>
      - format: 'i'
      - name: 'some_column'
      - flags: 2
      - metadata: NULL
      - dictionary: NULL
      - children[0]:

Development

Python bindings for nanoarrow are managed with setuptools. This means you can build the project using:

git clone https://github.com/apache/arrow-nanoarrow.git
cd arrow-nanoarrow/python
pip install -e .

Tests use pytest:

# Install dependencies
pip install -e .[test]

# Run tests
pytest -vvx