blob: 0f13829a2b8deb5ba5d320a4087dd3a2051a78b3 [file] [log] [blame]
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"<!---\n",
" Licensed to the Apache Software Foundation (ASF) under one\n",
" or more contributor license agreements. See the NOTICE file\n",
" distributed with this work for additional information\n",
" regarding copyright ownership. The ASF licenses this file\n",
" to you under the Apache License, Version 2.0 (the\n",
" \"License\"); you may not use this file except in compliance\n",
" with the License. You may obtain a copy of the License at\n",
"\n",
" http://www.apache.org/licenses/LICENSE-2.0\n",
"\n",
" Unless required by applicable law or agreed to in writing,\n",
" software distributed under the License is distributed on an\n",
" \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
" KIND, either express or implied. See the License for the\n",
" specific language governing permissions and limitations\n",
" under the License.\n",
"-->\n",
"\n",
"<!-- Render with jupyter nbconvert --to markdown README.ipynb -->\n",
"\n",
"# nanoarrow for Python\n",
"\n",
"The nanoarrow Python package provides bindings to the nanoarrow C library. Like\n",
"the nanoarrow C library, it provides tools to facilitate the use of the\n",
"[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html) \n",
"and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html) \n",
"interfaces.\n",
"\n",
"## Installation\n",
"\n",
"Python bindings for nanoarrow are not yet available on PyPI. You can install via\n",
"URL (requires a C compiler):\n",
"\n",
"```bash\n",
"python -m pip install \"git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python\"\n",
"```\n",
"\n",
"If you can import the namespace, you're good to go!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import nanoarrow as na"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Low-level C library bindings\n",
"\n",
"The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`.\n",
"\n",
"### Schemas\n",
"\n",
"Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_lib.CSchema decimal128(10, 3)>\n",
"- format: 'd:10,3'\n",
"- name: ''\n",
"- flags: 2\n",
"- metadata: NULL\n",
"- dictionary: NULL\n",
"- children[0]:"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pyarrow as pa\n",
"schema = na.c_schema(pa.decimal128(10, 3))\n",
"schema"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can extract the fields of a `CSchema` object one at a time or parse it into a view to extract deserialized parameters."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_lib.CSchemaView>\n",
"- type: 'decimal128'\n",
"- storage_type: 'decimal128'\n",
"- decimal_bitwidth: 128\n",
"- decimal_precision: 10\n",
"- decimal_scale: 3\n",
"- dictionary_ordered: False\n",
"- map_keys_sorted: False\n",
"- nullable: True\n",
"- storage_type_id: 24\n",
"- type_id: 24"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na.c_schema_view(schema)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Advanced users can allocate an empty `CSchema` and populate its contents by passing its `._addr()` to a schema-exporting function."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_lib.CSchema int32>\n",
"- format: 'i'\n",
"- name: ''\n",
"- flags: 2\n",
"- metadata: NULL\n",
"- dictionary: NULL\n",
"- children[0]:"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema = na.allocate_c_schema()\n",
"pa.int32()._export_to_c(schema._addr())\n",
"schema"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Arrays\n",
"\n",
"You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_lib.CArray string>\n",
"- length: 4\n",
"- offset: 0\n",
"- null_count: 1\n",
"- buffers: (3678035706048, 3678035705984, 3678035706112)\n",
"- dictionary: NULL\n",
"- children[0]:"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array = na.c_array(pa.array([\"one\", \"two\", \"three\", None]))\n",
"array"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can extract the fields of a `CArray` one at a time or parse it into a view to extract deserialized content:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_lib.CArrayView>\n",
"- storage_type: 'string'\n",
"- length: 4\n",
"- offset: 0\n",
"- null_count: 1\n",
"- buffers[3]:\n",
" - validity <bool[1 b] 11100000>\n",
" - data_offset <int32[20 b] 0 3 6 11 11>\n",
" - data <string[11 b] b'onetwothree'>\n",
"- dictionary: NULL\n",
"- children[0]:"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na.c_array_view(array)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Like the `CSchema`, you can allocate an empty one and access its address with `_addr()` to pass to other array-exporting functions."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array = na.allocate_c_array()\n",
"pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())\n",
"array.length"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Array streams\n",
"\n",
"You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_lib.CArrayStream>\n",
"- get_schema(): struct<some_column: int32>"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pa_array_child = pa.array([1, 2, 3], pa.int32())\n",
"pa_array = pa.record_batch([pa_array_child], names=[\"some_column\"])\n",
"reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])\n",
"array_stream = na.c_array_stream(reader)\n",
"array_stream"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<nanoarrow.c_lib.CArray struct<some_column: int32>>\n",
"- length: 3\n",
"- offset: 0\n",
"- null_count: 0\n",
"- buffers: (0,)\n",
"- dictionary: NULL\n",
"- children[1]:\n",
" 'some_column': <nanoarrow.c_lib.CArray int32>\n",
" - length: 3\n",
" - offset: 0\n",
" - null_count: 0\n",
" - buffers: (0, 3678035837056)\n",
" - dictionary: NULL\n",
" - children[0]:\n"
]
}
],
"source": [
"for array in array_stream:\n",
" print(array)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also get the address of a freshly-allocated stream to pass to a suitable exporting function:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<nanoarrow.c_lib.CArrayStream>\n",
"- get_schema(): struct<some_column: int32>"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_stream = na.allocate_c_array_stream()\n",
"reader._export_to_c(array_stream._addr())\n",
"array_stream"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Development\n",
"\n",
"Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).\n",
"This means you can build the project using:\n",
"\n",
"```shell\n",
"git clone https://github.com/apache/arrow-nanoarrow.git\n",
"cd arrow-nanoarrow/python\n",
"pip install -e .\n",
"```\n",
"\n",
"Tests use [pytest](https://docs.pytest.org/):\n",
"\n",
"```shell\n",
"# Install dependencies\n",
"pip install -e .[test]\n",
"\n",
"# Run tests\n",
"pytest -vvx\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}