python/README.md - arrow-nanoarrow - Git at Google

 <!---
   Licensed to the Apache Software Foundation (ASF) under one
   or more contributor license agreements.  See the NOTICE file
   distributed with this work for additional information
   regarding copyright ownership.  The ASF licenses this file
   to you under the Apache License, Version 2.0 (the
   "License"); you may not use this file except in compliance
   with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing,
   software distributed under the License is distributed on an
   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
   KIND, either express or implied.  See the License for the
   specific language governing permissions and limitations
   under the License.
 -->

 <!-- Render with jupyter nbconvert --to markdown README.ipynb -->

 # nanoarrow for Python

 The nanoarrow Python package provides bindings to the nanoarrow C library. Like
 the nanoarrow C library, it provides tools to facilitate the use of the
 [Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html)
 and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html)
 interfaces.

 ## Installation

 The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and
 [conda-forge](https://conda-forge.org/):

 ```shell
 pip install nanoarrow
 conda install nanoarrow -c conda-forge
 ```

 Development versions (based on the `main` branch) are also available:

 ```shell
 pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
     --prefer-binary --pre nanoarrow
 ```

 If you can import the namespace, you're good to go!


 ```python
 import nanoarrow as na
 ```

 ## Data types, arrays, and array streams

 The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package.


 ```python
 na.int32()
 ```


     <Schema> int32


 ```python
 na.Array([1, 2, 3], na.int32())
 ```


     nanoarrow.Array<int32>[3]
     1
     2
     3


 The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this.


 ```python
 chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())
 chunked
 ```


     nanoarrow.Array<int32>[6]
     1
     2
     3
     4
     5
     6


 Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet.


 ```python
 stream = na.ArrayStream(chunked)
 stream
 ```


     nanoarrow.ArrayStream<int32>


 ```python
 with stream:
     for chunk in stream:
         print(chunk)
 ```

     nanoarrow.Array<int32>[3]
     1
     2
     3
     nanoarrow.Array<int32>[3]
     4
     5
     6


 The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader:


 ```python
 url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
 na.ArrayStream.from_url(url)
 ```


     nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>


 These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases:


 ```python
 import pyarrow as pa

 pa.field(na.int32())
 ```


     pyarrow.Field<: int32>


 ```python
 pa.chunked_array(chunked)
 ```


     <pyarrow.lib.ChunkedArray object at 0x12a49a250>
     [
       [
         1,
         2,
         3
       ],
       [
         4,
         5,
         6
       ]
     ]


 ```python
 pa.array(chunked.chunk(1))
 ```


     <pyarrow.lib.Int32Array object at 0x11b552500>
     [
       4,
       5,
       6
     ]


 ```python
 na.Array(pa.array([10, 11, 12]))
 ```


     nanoarrow.Array<int64>[3]
     10
     11
     12


 ```python
 na.Schema(pa.string())
 ```


     <Schema> string


 ## Low-level C library bindings

 The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.

 ### Schemas

 Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`).


 ```python
 na.c_schema(pa.decimal128(10, 3))
 ```


     <nanoarrow.c_schema.CSchema decimal128(10, 3)>
     - format: 'd:10,3'
     - name: ''
     - flags: 2
     - metadata: NULL
     - dictionary: NULL
     - children[0]:


 Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`:


 ```python
 schema = na.Schema(pa.decimal128(10, 3))
 schema.precision, schema.scale
 ```


     (10, 3)


 The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released.

 ### Arrays

 You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`).


 ```python
 na.c_array(["one", "two", "three", None], na.string())
 ```


     <nanoarrow.c_array.CArray string>
     - length: 4
     - offset: 0
     - null_count: 1
     - buffers: (4754305168, 4754307808, 4754310464)
     - dictionary: NULL
     - children[0]:


 Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`:


 ```python
 array = na.Array(["one", "two", "three", None], na.string())
 array.to_pylist()
 ```


     ['one', 'two', 'three', None]


 ```python
 array.buffers
 ```


     (nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),
      nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),
      nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))


 Advanced users can create arrays directly from buffers using `c_array_from_buffers()`:


 ```python
 na.c_array_from_buffers(
     na.string(),
     2,
     [None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"]
 )
 ```


     <nanoarrow.c_array.CArray string>
     - length: 2
     - offset: 0
     - null_count: 0
     - buffers: (0, 5002908320, 4999694624)
     - dictionary: NULL
     - children[0]:


 ### Array streams

 You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`).


 ```python
 pa_batch = pa.record_batch({"col1": [1, 2, 3]})
 reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
 array_stream = na.c_array_stream(reader)
 array_stream
 ```


     <nanoarrow.c_array_stream.CArrayStream>
     - get_schema(): struct<col1: int64>


 You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream.


 ```python
 for array in array_stream:
     print(array)
 ```

     <nanoarrow.c_array.CArray struct<col1: int64>>
     - length: 3
     - offset: 0
     - null_count: 0
     - buffers: (0,)
     - dictionary: NULL
     - children[1]:
       'col1': <nanoarrow.c_array.CArray int64>
         - length: 3
         - offset: 0
         - null_count: 0
         - buffers: (0, 2642948588352)
         - dictionary: NULL
         - children[0]:


 Use `ArrayStream()` for a higher level interface:


 ```python
 reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
 na.ArrayStream(reader).read_all()
 ```


     nanoarrow.Array<non-nullable struct<col1: int64>>[3]
     {'col1': 1}
     {'col1': 2}
     {'col1': 3}


 ## Development

 Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).
 This means you can build the project using:

 ```shell
 git clone https://github.com/apache/arrow-nanoarrow.git
 cd arrow-nanoarrow/python
 # Build dependencies:
 # pip install meson meson-python cython
 pip install -e . --no-build-isolation
 ```

 Tests use [pytest](https://docs.pytest.org/):

 ```shell
 # Install dependencies
 pip install -e ".[test]"

 # Run tests
 pytest -vvx
 ```

 CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree.
	<!---
	Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.
	-->

	<!-- Render with jupyter nbconvert --to markdown README.ipynb -->

	# nanoarrow for Python

	The nanoarrow Python package provides bindings to the nanoarrow C library. Like
	the nanoarrow C library, it provides tools to facilitate the use of the
	[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html)
	and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html)
	interfaces.

	## Installation

	The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and
	[conda-forge](https://conda-forge.org/):

	```shell
	pip install nanoarrow
	conda install nanoarrow -c conda-forge
	```

	Development versions (based on the `main` branch) are also available:

	```shell
	pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
	--prefer-binary --pre nanoarrow
	```

	If you can import the namespace, you're good to go!


	```python
	import nanoarrow as na
	```

	## Data types, arrays, and array streams

	The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package.


	```python
	na.int32()
	```




	<Schema> int32




	```python
	na.Array([1, 2, 3], na.int32())
	```




	nanoarrow.Array<int32>[3]
	1
	2
	3



	The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this.


	```python
	chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())
	chunked
	```




	nanoarrow.Array<int32>[6]
	1
	2
	3
	4
	5
	6



	Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet.


	```python
	stream = na.ArrayStream(chunked)
	stream
	```




	nanoarrow.ArrayStream<int32>




	```python
	with stream:
	for chunk in stream:
	print(chunk)
	```

	nanoarrow.Array<int32>[3]
	1
	2
	3
	nanoarrow.Array<int32>[3]
	4
	5
	6


	The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader:


	```python
	url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
	na.ArrayStream.from_url(url)
	```




	nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>



	These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases:


	```python
	import pyarrow as pa

	pa.field(na.int32())
	```




	pyarrow.Field<: int32>




	```python
	pa.chunked_array(chunked)
	```




	<pyarrow.lib.ChunkedArray object at 0x12a49a250>
	[
	[
	1,
	2,
	3
	],
	[
	4,
	5,
	6
	]
	]




	```python
	pa.array(chunked.chunk(1))
	```




	<pyarrow.lib.Int32Array object at 0x11b552500>
	[
	4,
	5,
	6
	]




	```python
	na.Array(pa.array([10, 11, 12]))
	```




	nanoarrow.Array<int64>[3]
	10
	11
	12




	```python
	na.Schema(pa.string())
	```




	<Schema> string



	## Low-level C library bindings

	The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.

	### Schemas

	Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`).


	```python
	na.c_schema(pa.decimal128(10, 3))
	```




	<nanoarrow.c_schema.CSchema decimal128(10, 3)>
	- format: 'd:10,3'
	- name: ''
	- flags: 2
	- metadata: NULL
	- dictionary: NULL
	- children[0]:



	Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`:


	```python
	schema = na.Schema(pa.decimal128(10, 3))
	schema.precision, schema.scale
	```




	(10, 3)



	The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released.

	### Arrays

	You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`).


	```python
	na.c_array(["one", "two", "three", None], na.string())
	```




	<nanoarrow.c_array.CArray string>
	- length: 4
	- offset: 0
	- null_count: 1
	- buffers: (4754305168, 4754307808, 4754310464)
	- dictionary: NULL
	- children[0]:



	Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`:


	```python
	array = na.Array(["one", "two", "three", None], na.string())
	array.to_pylist()
	```




	['one', 'two', 'three', None]




	```python
	array.buffers
	```




	(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),
	nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),
	nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))



	Advanced users can create arrays directly from buffers using `c_array_from_buffers()`:


	```python
	na.c_array_from_buffers(
	na.string(),
	2,
	[None, na.c_buffer([0, 3, 6], na.int32()), b"abcdef"]
	)
	```




	<nanoarrow.c_array.CArray string>
	- length: 2
	- offset: 0
	- null_count: 0
	- buffers: (0, 5002908320, 4999694624)
	- dictionary: NULL
	- children[0]:



	### Array streams

	You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`).


	```python
	pa_batch = pa.record_batch({"col1": [1, 2, 3]})
	reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
	array_stream = na.c_array_stream(reader)
	array_stream
	```




	<nanoarrow.c_array_stream.CArrayStream>
	- get_schema(): struct<col1: int64>



	You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream.


	```python
	for array in array_stream:
	print(array)
	```

	<nanoarrow.c_array.CArray struct<col1: int64>>
	- length: 3
	- offset: 0
	- null_count: 0
	- buffers: (0,)
	- dictionary: NULL
	- children[1]:
	'col1': <nanoarrow.c_array.CArray int64>
	- length: 3
	- offset: 0
	- null_count: 0
	- buffers: (0, 2642948588352)
	- dictionary: NULL
	- children[0]:


	Use `ArrayStream()` for a higher level interface:


	```python
	reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])
	na.ArrayStream(reader).read_all()
	```




	nanoarrow.Array<non-nullable struct<col1: int64>>[3]
	{'col1': 1}
	{'col1': 2}
	{'col1': 3}



	## Development

	Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).
	This means you can build the project using:

	```shell
	git clone https://github.com/apache/arrow-nanoarrow.git
	cd arrow-nanoarrow/python
	# Build dependencies:
	# pip install meson meson-python cython
	pip install -e . --no-build-isolation
	```

	Tests use [pytest](https://docs.pytest.org/):

	```shell
	# Install dependencies
	pip install -e ".[test]"

	# Run tests
	pytest -vvx
	```

	CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree.