| { |
| "cells": [ |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "<!---\n", |
| " Licensed to the Apache Software Foundation (ASF) under one\n", |
| " or more contributor license agreements. See the NOTICE file\n", |
| " distributed with this work for additional information\n", |
| " regarding copyright ownership. The ASF licenses this file\n", |
| " to you under the Apache License, Version 2.0 (the\n", |
| " \"License\"); you may not use this file except in compliance\n", |
| " with the License. You may obtain a copy of the License at\n", |
| "\n", |
| " http://www.apache.org/licenses/LICENSE-2.0\n", |
| "\n", |
| " Unless required by applicable law or agreed to in writing,\n", |
| " software distributed under the License is distributed on an\n", |
| " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", |
| " KIND, either express or implied. See the License for the\n", |
| " specific language governing permissions and limitations\n", |
| " under the License.\n", |
| "-->\n", |
| "\n", |
| "<!-- Render with jupyter nbconvert --to markdown README.ipynb -->\n", |
| "\n", |
| "# nanoarrow for Python\n", |
| "\n", |
| "The nanoarrow Python package provides bindings to the nanoarrow C library. Like\n", |
| "the nanoarrow C library, it provides tools to facilitate the use of the\n", |
| "[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html) \n", |
| "and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html) \n", |
| "interfaces.\n", |
| "\n", |
| "## Installation\n", |
| "\n", |
| "Python bindings for nanoarrow are not yet available on PyPI. You can install via\n", |
| "URL (requires a C compiler):\n", |
| "\n", |
| "```bash\n", |
| "python -m pip install \"git+https://github.com/apache/arrow-nanoarrow.git#egg=nanoarrow&subdirectory=python\"\n", |
| "```\n", |
| "\n", |
| "If you can import the namespace, you're good to go!" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 1, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "import nanoarrow as na" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## Low-level C library bindings\n", |
| "\n", |
| "The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`.\n", |
| "\n", |
| "### Schemas\n", |
| "\n", |
| "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 2, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_lib.CSchema decimal128(10, 3)>\n", |
| "- format: 'd:10,3'\n", |
| "- name: ''\n", |
| "- flags: 2\n", |
| "- metadata: NULL\n", |
| "- dictionary: NULL\n", |
| "- children[0]:" |
| ] |
| }, |
| "execution_count": 2, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "import pyarrow as pa\n", |
| "schema = na.c_schema(pa.decimal128(10, 3))\n", |
| "schema" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "You can extract the fields of a `CSchema` object one at a time or parse it into a view to extract deserialized parameters." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 3, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_lib.CSchemaView>\n", |
| "- type: 'decimal128'\n", |
| "- storage_type: 'decimal128'\n", |
| "- decimal_bitwidth: 128\n", |
| "- decimal_precision: 10\n", |
| "- decimal_scale: 3\n", |
| "- dictionary_ordered: False\n", |
| "- map_keys_sorted: False\n", |
| "- nullable: True\n", |
| "- storage_type_id: 24\n", |
| "- type_id: 24" |
| ] |
| }, |
| "execution_count": 3, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "na.c_schema_view(schema)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Advanced users can allocate an empty `CSchema` and populate its contents by passing its `._addr()` to a schema-exporting function." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 4, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_lib.CSchema int32>\n", |
| "- format: 'i'\n", |
| "- name: ''\n", |
| "- flags: 2\n", |
| "- metadata: NULL\n", |
| "- dictionary: NULL\n", |
| "- children[0]:" |
| ] |
| }, |
| "execution_count": 4, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "schema = na.allocate_c_schema()\n", |
| "pa.int32()._export_to_c(schema._addr())\n", |
| "schema" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released." |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Arrays\n", |
| "\n", |
| "You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 5, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_lib.CArray string>\n", |
| "- length: 4\n", |
| "- offset: 0\n", |
| "- null_count: 1\n", |
| "- buffers: (3678035706048, 3678035705984, 3678035706112)\n", |
| "- dictionary: NULL\n", |
| "- children[0]:" |
| ] |
| }, |
| "execution_count": 5, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "array = na.c_array(pa.array([\"one\", \"two\", \"three\", None]))\n", |
| "array" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "You can extract the fields of a `CArray` one at a time or parse it into a view to extract deserialized content:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 6, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_lib.CArrayView>\n", |
| "- storage_type: 'string'\n", |
| "- length: 4\n", |
| "- offset: 0\n", |
| "- null_count: 1\n", |
| "- buffers[3]:\n", |
| " - validity <bool[1 b] 11100000>\n", |
| " - data_offset <int32[20 b] 0 3 6 11 11>\n", |
| " - data <string[11 b] b'onetwothree'>\n", |
| "- dictionary: NULL\n", |
| "- children[0]:" |
| ] |
| }, |
| "execution_count": 6, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "na.c_array_view(array)" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Like the `CSchema`, you can allocate an empty one and access its address with `_addr()` to pass to other array-exporting functions." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 7, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "3" |
| ] |
| }, |
| "execution_count": 7, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "array = na.allocate_c_array()\n", |
| "pa.array([1, 2, 3])._export_to_c(array._addr(), array.schema._addr())\n", |
| "array.length" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Array streams\n", |
| "\n", |
| "You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 8, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_lib.CArrayStream>\n", |
| "- get_schema(): struct<some_column: int32>" |
| ] |
| }, |
| "execution_count": 8, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "pa_array_child = pa.array([1, 2, 3], pa.int32())\n", |
| "pa_array = pa.record_batch([pa_array_child], names=[\"some_column\"])\n", |
| "reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])\n", |
| "array_stream = na.c_array_stream(reader)\n", |
| "array_stream" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 9, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "<nanoarrow.c_lib.CArray struct<some_column: int32>>\n", |
| "- length: 3\n", |
| "- offset: 0\n", |
| "- null_count: 0\n", |
| "- buffers: (0,)\n", |
| "- dictionary: NULL\n", |
| "- children[1]:\n", |
| " 'some_column': <nanoarrow.c_lib.CArray int32>\n", |
| " - length: 3\n", |
| " - offset: 0\n", |
| " - null_count: 0\n", |
| " - buffers: (0, 3678035837056)\n", |
| " - dictionary: NULL\n", |
| " - children[0]:\n" |
| ] |
| } |
| ], |
| "source": [ |
| "for array in array_stream:\n", |
| " print(array)" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "You can also get the address of a freshly-allocated stream to pass to a suitable exporting function:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 10, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_lib.CArrayStream>\n", |
| "- get_schema(): struct<some_column: int32>" |
| ] |
| }, |
| "execution_count": 10, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "array_stream = na.allocate_c_array_stream()\n", |
| "reader._export_to_c(array_stream._addr())\n", |
| "array_stream" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## Development\n", |
| "\n", |
| "Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).\n", |
| "This means you can build the project using:\n", |
| "\n", |
| "```shell\n", |
| "git clone https://github.com/apache/arrow-nanoarrow.git\n", |
| "cd arrow-nanoarrow/python\n", |
| "pip install -e .\n", |
| "```\n", |
| "\n", |
| "Tests use [pytest](https://docs.pytest.org/):\n", |
| "\n", |
| "```shell\n", |
| "# Install dependencies\n", |
| "pip install -e .[test]\n", |
| "\n", |
| "# Run tests\n", |
| "pytest -vvx\n", |
| "```" |
| ] |
| } |
| ], |
| "metadata": { |
| "kernelspec": { |
| "display_name": "Python 3", |
| "language": "python", |
| "name": "python3" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.11.4" |
| }, |
| "orig_nbformat": 4 |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 2 |
| } |