blob: 5d62065ba32457b0654b6071a27565771a363c34 [file] [log] [blame]
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"<!---\n",
" Licensed to the Apache Software Foundation (ASF) under one\n",
" or more contributor license agreements. See the NOTICE file\n",
" distributed with this work for additional information\n",
" regarding copyright ownership. The ASF licenses this file\n",
" to you under the Apache License, Version 2.0 (the\n",
" \"License\"); you may not use this file except in compliance\n",
" with the License. You may obtain a copy of the License at\n",
"\n",
" http://www.apache.org/licenses/LICENSE-2.0\n",
"\n",
" Unless required by applicable law or agreed to in writing,\n",
" software distributed under the License is distributed on an\n",
" \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
" KIND, either express or implied. See the License for the\n",
" specific language governing permissions and limitations\n",
" under the License.\n",
"-->\n",
"\n",
"<!-- Render with jupyter nbconvert --to markdown README.ipynb -->\n",
"\n",
"# nanoarrow for Python\n",
"\n",
"The nanoarrow Python package provides bindings to the nanoarrow C library. Like\n",
"the nanoarrow C library, it provides tools to facilitate the use of the\n",
"[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html) \n",
"and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html) \n",
"interfaces.\n",
"\n",
"## Installation\n",
"\n",
"The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and\n",
"[conda-forge](https://conda-forge.org/):\n",
"\n",
"```shell\n",
"pip install nanoarrow\n",
"conda install nanoarrow -c conda-forge\n",
"```\n",
"\n",
"Development versions (based on the `main` branch) are also available:\n",
"\n",
"```shell\n",
"pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \\\n",
" --prefer-binary --pre nanoarrow\n",
"```\n",
"\n",
"If you can import the namespace, you're good to go!"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"import nanoarrow as na"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data types, arrays, and array streams\n",
"\n",
"The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Schema> int32"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na.int32()"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nanoarrow.Array<int32>[3]\n",
"1\n",
"2\n",
"3"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na.Array([1, 2, 3], na.int32())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nanoarrow.Array<int32>[6]\n",
"1\n",
"2\n",
"3\n",
"4\n",
"5\n",
"6"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())\n",
"chunked"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nanoarrow.ArrayStream<int32>"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"stream = na.ArrayStream(chunked)\n",
"stream"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"nanoarrow.Array<int32>[3]\n",
"1\n",
"2\n",
"3\n",
"nanoarrow.Array<int32>[3]\n",
"4\n",
"5\n",
"6\n"
]
}
],
"source": [
"with stream:\n",
" for chunk in stream:\n",
" print(chunk)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader:"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url = \"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows\"\n",
"na.ArrayStream.from_url(url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases:"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pyarrow.Field<: int32>"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pyarrow as pa\n",
"\n",
"pa.field(na.int32())"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<pyarrow.lib.ChunkedArray object at 0x12a49a250>\n",
"[\n",
" [\n",
" 1,\n",
" 2,\n",
" 3\n",
" ],\n",
" [\n",
" 4,\n",
" 5,\n",
" 6\n",
" ]\n",
"]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pa.chunked_array(chunked)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<pyarrow.lib.Int32Array object at 0x11b552500>\n",
"[\n",
" 4,\n",
" 5,\n",
" 6\n",
"]"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pa.array(chunked.chunk(1))"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nanoarrow.Array<int64>[3]\n",
"10\n",
"11\n",
"12"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na.Array(pa.array([10, 11, 12]))"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Schema> string"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na.Schema(pa.string())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Low-level C library bindings\n",
"\n",
"The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.\n",
"\n",
"### Schemas\n",
"\n",
"Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_schema.CSchema decimal128(10, 3)>\n",
"- format: 'd:10,3'\n",
"- name: ''\n",
"- flags: 2\n",
"- metadata: NULL\n",
"- dictionary: NULL\n",
"- children[0]:"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na.c_schema(pa.decimal128(10, 3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`:"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(10, 3)"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema = na.Schema(pa.decimal128(10, 3))\n",
"schema.precision, schema.scale"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Arrays\n",
"\n",
"You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`)."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_array.CArray string>\n",
"- length: 4\n",
"- offset: 0\n",
"- null_count: 1\n",
"- buffers: (4754305168, 4754307808, 4754310464)\n",
"- dictionary: NULL\n",
"- children[0]:"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na.c_array([\"one\", \"two\", \"three\", None], na.string())"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`:"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['one', 'two', 'three', None]"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array = na.Array([\"one\", \"two\", \"three\", None], na.string())\n",
"array.to_pylist()"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),\n",
" nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),\n",
" nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array.buffers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Advanced users can create arrays directly from buffers using `c_array_from_buffers()`:"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_array.CArray string>\n",
"- length: 2\n",
"- offset: 0\n",
"- null_count: 0\n",
"- buffers: (0, 5002908320, 4999694624)\n",
"- dictionary: NULL\n",
"- children[0]:"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na.c_array_from_buffers(\n",
" na.string(),\n",
" 2,\n",
" [None, na.c_buffer([0, 3, 6], na.int32()), b\"abcdef\"]\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Array streams\n",
"\n",
"You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`)."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_array_stream.CArrayStream>\n",
"- get_schema(): struct<col1: int64>"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pa_batch = pa.record_batch({\"col1\": [1, 2, 3]})\n",
"reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n",
"array_stream = na.c_array_stream(reader)\n",
"array_stream"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<nanoarrow.c_array.CArray struct<col1: int64>>\n",
"- length: 3\n",
"- offset: 0\n",
"- null_count: 0\n",
"- buffers: (0,)\n",
"- dictionary: NULL\n",
"- children[1]:\n",
" 'col1': <nanoarrow.c_array.CArray int64>\n",
" - length: 3\n",
" - offset: 0\n",
" - null_count: 0\n",
" - buffers: (0, 2642948588352)\n",
" - dictionary: NULL\n",
" - children[0]:\n"
]
}
],
"source": [
"for array in array_stream:\n",
" print(array)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `ArrayStream()` for a higher level interface:"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"nanoarrow.Array<non-nullable struct<col1: int64>>[3]\n",
"{'col1': 1}\n",
"{'col1': 2}\n",
"{'col1': 3}"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n",
"na.ArrayStream(reader).read_all()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Development\n",
"\n",
"Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).\n",
"This means you can build the project using:\n",
"\n",
"```shell\n",
"git clone https://github.com/apache/arrow-nanoarrow.git\n",
"cd arrow-nanoarrow/python\n",
"pip install -e .\n",
"```\n",
"\n",
"Tests use [pytest](https://docs.pytest.org/):\n",
"\n",
"```shell\n",
"# Install dependencies\n",
"pip install -e \".[test]\"\n",
"\n",
"# Run tests\n",
"pytest -vvx\n",
"```\n",
"\n",
"CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}