blob: d89d4c4a6e788c76f0bf75454a7def0030a8d90c [file] [log] [blame]
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"<!---\n",
" Licensed to the Apache Software Foundation (ASF) under one\n",
" or more contributor license agreements. See the NOTICE file\n",
" distributed with this work for additional information\n",
" regarding copyright ownership. The ASF licenses this file\n",
" to you under the Apache License, Version 2.0 (the\n",
" \"License\"); you may not use this file except in compliance\n",
" with the License. You may obtain a copy of the License at\n",
"\n",
" http://www.apache.org/licenses/LICENSE-2.0\n",
"\n",
" Unless required by applicable law or agreed to in writing,\n",
" software distributed under the License is distributed on an\n",
" \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
" KIND, either express or implied. See the License for the\n",
" specific language governing permissions and limitations\n",
" under the License.\n",
"-->\n",
"\n",
"<!-- Render with jupyter nbconvert --to markdown README.ipynb -->\n",
"\n",
"# nanoarrow for Python\n",
"\n",
"The nanoarrow Python package provides bindings to the nanoarrow C library. Like\n",
"the nanoarrow C library, it provides tools to facilitate the use of the\n",
"[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html) \n",
"and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html) \n",
"interfaces.\n",
"\n",
"## Installation\n",
"\n",
"Python bindings for nanoarrow are not yet available on PyPI. You can install via\n",
"URL (requires a C compiler):\n",
"\n",
"```bash\n",
"python -m pip install \"https://github.com/apache/arrow-nanoarrow/archive/refs/heads/main.zip#egg=nanoarrow&subdirectory=python\"\n",
"```\n",
"\n",
"If you can import the namespace, you're good to go!"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import nanoarrow as na"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example\n",
"\n",
"The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. All three can be wrapped by Python objects using the nanoarrow Python package.\n",
"\n",
"### Schemas\n",
"\n",
"Use `nanoarrow.schema()` to convert a data type-like object to an `ArrowSchema`. This is currently only implemented for pyarrow objects."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pyarrow as pa\n",
"schema = na.schema(pa.decimal128(10, 3))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can extract the fields of a `Schema` object one at a time or parse it into a view to extract deserialized parameters."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"d:10,3\n",
"10\n",
"3\n"
]
}
],
"source": [
"print(schema.format)\n",
"print(schema.view().decimal_precision)\n",
"print(schema.view().decimal_scale)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `nanoarrow.schema()` helper is currently only implemented for pyarrow objects. If your data type has an `_export_to_c()`-like function, you can get the address of a freshly-allocated `ArrowSchema` as well:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'int32'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"schema = na.Schema.allocate()\n",
"pa.int32()._export_to_c(schema._addr())\n",
"schema.view().type"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The `Schema` object cleans up after itself: when the object is deleted, the underlying `Schema` is released."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Arrays\n",
"\n",
"You can use `nanoarrow.array()` to convert an array-like object to a `nanoarrow.Array`, optionally attaching a `Schema` that can be used to interpret its contents. This is currently only implemented for pyarrow objects."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"array = na.array(pa.array([\"one\", \"two\", \"three\", None]))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Like the `Schema`, you can inspect an `Array` by extracting fields individually:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4\n",
"1\n"
]
}
],
"source": [
"print(array.length)\n",
"print(array.null_count)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"...and parse the `Array`/`Schema` combination into a view whose contents is more readily accessible."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[array([7], dtype=uint8),\n",
" array([ 0, 3, 6, 11, 11], dtype=int32),\n",
" array([b'o', b'n', b'e', b't', b'w', b'o', b't', b'h', b'r', b'e', b'e'],\n",
" dtype='|S1')]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"view = array.view()\n",
"[np.array(buffer) for buffer in view.buffers]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Like the `Schema`, you can allocate an empty one and access its address with `_addr()` to pass to other array-exporting functions."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array = na.Array.allocate(na.Schema.allocate())\n",
"pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())\n",
"array.length"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Array streams\n",
"\n",
"You can use `nanoarrow.array_stream()` to convert an object representing a sequence of `Array`s with a common `Schema` to a `nanoarrow.ArrayStream`. This is currently only implemented for pyarrow objects."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"pa_array_child = pa.array([1, 2, 3], pa.int32())\n",
"pa_array = pa.record_batch([pa_array_child], names=[\"some_column\"])\n",
"reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])\n",
"array_stream = na.array_stream(reader)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pull the next array from the stream using `.get_next()` or use it like an interator. The `.get_next()` method will return `None` when there are no more arrays in the stream."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"struct<some_column: int32>\n",
"3\n",
"True\n"
]
}
],
"source": [
"print(array_stream.get_schema())\n",
"\n",
"for array in array_stream:\n",
" print(array.length)\n",
"\n",
"print(array_stream.get_next() is None)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also get the address of a freshly-allocated stream to pass to a suitable exporting function:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"struct<some_column: int32>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"array_stream = na.ArrayStream.allocate()\n",
"reader._export_to_c(array_stream._addr())\n",
"array_stream.get_schema()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Development\n",
"\n",
"Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).\n",
"This means you can build the project using:\n",
"\n",
"```shell\n",
"git clone https://github.com/apache/arrow-nanoarrow.git\n",
"cd arrow-nanoarrow/python\n",
"pip install -e .\n",
"```\n",
"\n",
"Tests use [pytest](https://docs.pytest.org/):\n",
"\n",
"```shell\n",
"# Install dependencies\n",
"pip install -e .[test]\n",
"\n",
"# Run tests\n",
"pytest -vvx\n",
"```"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}