| { |
| "cells": [ |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "<!---\n", |
| " Licensed to the Apache Software Foundation (ASF) under one\n", |
| " or more contributor license agreements. See the NOTICE file\n", |
| " distributed with this work for additional information\n", |
| " regarding copyright ownership. The ASF licenses this file\n", |
| " to you under the Apache License, Version 2.0 (the\n", |
| " \"License\"); you may not use this file except in compliance\n", |
| " with the License. You may obtain a copy of the License at\n", |
| "\n", |
| " http://www.apache.org/licenses/LICENSE-2.0\n", |
| "\n", |
| " Unless required by applicable law or agreed to in writing,\n", |
| " software distributed under the License is distributed on an\n", |
| " \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", |
| " KIND, either express or implied. See the License for the\n", |
| " specific language governing permissions and limitations\n", |
| " under the License.\n", |
| "-->\n", |
| "\n", |
| "<!-- Render with jupyter nbconvert --to markdown README.ipynb -->\n", |
| "\n", |
| "# nanoarrow for Python\n", |
| "\n", |
| "The nanoarrow Python package provides bindings to the nanoarrow C library. Like\n", |
| "the nanoarrow C library, it provides tools to facilitate the use of the\n", |
| "[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html) \n", |
| "and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html) \n", |
| "interfaces.\n", |
| "\n", |
| "## Installation\n", |
| "\n", |
| "The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and\n", |
| "[conda-forge](https://conda-forge.org/):\n", |
| "\n", |
| "```shell\n", |
| "pip install nanoarrow\n", |
| "conda install nanoarrow -c conda-forge\n", |
| "```\n", |
| "\n", |
| "Development versions (based on the `main` branch) are also available:\n", |
| "\n", |
| "```shell\n", |
| "pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \\\n", |
| " --prefer-binary --pre nanoarrow\n", |
| "```\n", |
| "\n", |
| "If you can import the namespace, you're good to go!" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 46, |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "import nanoarrow as na" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## Data types, arrays, and array streams\n", |
| "\n", |
| "The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 47, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<Schema> int32" |
| ] |
| }, |
| "execution_count": 47, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "na.int32()" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 48, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "nanoarrow.Array<int32>[3]\n", |
| "1\n", |
| "2\n", |
| "3" |
| ] |
| }, |
| "execution_count": 48, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "na.Array([1, 2, 3], na.int32())" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 49, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "nanoarrow.Array<int32>[6]\n", |
| "1\n", |
| "2\n", |
| "3\n", |
| "4\n", |
| "5\n", |
| "6" |
| ] |
| }, |
| "execution_count": 49, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())\n", |
| "chunked" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 50, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "nanoarrow.ArrayStream<int32>" |
| ] |
| }, |
| "execution_count": 50, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "stream = na.ArrayStream(chunked)\n", |
| "stream" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 51, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "nanoarrow.Array<int32>[3]\n", |
| "1\n", |
| "2\n", |
| "3\n", |
| "nanoarrow.Array<int32>[3]\n", |
| "4\n", |
| "5\n", |
| "6\n" |
| ] |
| } |
| ], |
| "source": [ |
| "with stream:\n", |
| " for chunk in stream:\n", |
| " print(chunk)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 52, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>" |
| ] |
| }, |
| "execution_count": 52, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "url = \"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows\"\n", |
| "na.ArrayStream.from_url(url)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 53, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "pyarrow.Field<: int32>" |
| ] |
| }, |
| "execution_count": 53, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "import pyarrow as pa\n", |
| "\n", |
| "pa.field(na.int32())" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 54, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<pyarrow.lib.ChunkedArray object at 0x12a49a250>\n", |
| "[\n", |
| " [\n", |
| " 1,\n", |
| " 2,\n", |
| " 3\n", |
| " ],\n", |
| " [\n", |
| " 4,\n", |
| " 5,\n", |
| " 6\n", |
| " ]\n", |
| "]" |
| ] |
| }, |
| "execution_count": 54, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "pa.chunked_array(chunked)" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 55, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<pyarrow.lib.Int32Array object at 0x11b552500>\n", |
| "[\n", |
| " 4,\n", |
| " 5,\n", |
| " 6\n", |
| "]" |
| ] |
| }, |
| "execution_count": 55, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "pa.array(chunked.chunk(1))" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 56, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "nanoarrow.Array<int64>[3]\n", |
| "10\n", |
| "11\n", |
| "12" |
| ] |
| }, |
| "execution_count": 56, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "na.Array(pa.array([10, 11, 12]))" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 57, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<Schema> string" |
| ] |
| }, |
| "execution_count": 57, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "na.Schema(pa.string())" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## Low-level C library bindings\n", |
| "\n", |
| "The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.\n", |
| "\n", |
| "### Schemas\n", |
| "\n", |
| "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 58, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_schema.CSchema decimal128(10, 3)>\n", |
| "- format: 'd:10,3'\n", |
| "- name: ''\n", |
| "- flags: 2\n", |
| "- metadata: NULL\n", |
| "- dictionary: NULL\n", |
| "- children[0]:" |
| ] |
| }, |
| "execution_count": 58, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "na.c_schema(pa.decimal128(10, 3))" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 59, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "(10, 3)" |
| ] |
| }, |
| "execution_count": 59, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "schema = na.Schema(pa.decimal128(10, 3))\n", |
| "schema.precision, schema.scale" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released." |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Arrays\n", |
| "\n", |
| "You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 60, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_array.CArray string>\n", |
| "- length: 4\n", |
| "- offset: 0\n", |
| "- null_count: 1\n", |
| "- buffers: (4754305168, 4754307808, 4754310464)\n", |
| "- dictionary: NULL\n", |
| "- children[0]:" |
| ] |
| }, |
| "execution_count": 60, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "na.c_array([\"one\", \"two\", \"three\", None], na.string())" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 61, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "['one', 'two', 'three', None]" |
| ] |
| }, |
| "execution_count": 61, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "array = na.Array([\"one\", \"two\", \"three\", None], na.string())\n", |
| "array.to_pylist()" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 62, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),\n", |
| " nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),\n", |
| " nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))" |
| ] |
| }, |
| "execution_count": 62, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "array.buffers" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Advanced users can create arrays directly from buffers using `c_array_from_buffers()`:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 63, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_array.CArray string>\n", |
| "- length: 2\n", |
| "- offset: 0\n", |
| "- null_count: 0\n", |
| "- buffers: (0, 5002908320, 4999694624)\n", |
| "- dictionary: NULL\n", |
| "- children[0]:" |
| ] |
| }, |
| "execution_count": 63, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "na.c_array_from_buffers(\n", |
| " na.string(),\n", |
| " 2,\n", |
| " [None, na.c_buffer([0, 3, 6], na.int32()), b\"abcdef\"]\n", |
| ")" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "### Array streams\n", |
| "\n", |
| "You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`)." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 64, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "<nanoarrow.c_array_stream.CArrayStream>\n", |
| "- get_schema(): struct<col1: int64>" |
| ] |
| }, |
| "execution_count": 64, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "pa_batch = pa.record_batch({\"col1\": [1, 2, 3]})\n", |
| "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n", |
| "array_stream = na.c_array_stream(reader)\n", |
| "array_stream" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 65, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "name": "stdout", |
| "output_type": "stream", |
| "text": [ |
| "<nanoarrow.c_array.CArray struct<col1: int64>>\n", |
| "- length: 3\n", |
| "- offset: 0\n", |
| "- null_count: 0\n", |
| "- buffers: (0,)\n", |
| "- dictionary: NULL\n", |
| "- children[1]:\n", |
| " 'col1': <nanoarrow.c_array.CArray int64>\n", |
| " - length: 3\n", |
| " - offset: 0\n", |
| " - null_count: 0\n", |
| " - buffers: (0, 2642948588352)\n", |
| " - dictionary: NULL\n", |
| " - children[0]:\n" |
| ] |
| } |
| ], |
| "source": [ |
| "for array in array_stream:\n", |
| " print(array)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "Use `ArrayStream()` for a higher level interface:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 66, |
| "metadata": {}, |
| "outputs": [ |
| { |
| "data": { |
| "text/plain": [ |
| "nanoarrow.Array<non-nullable struct<col1: int64>>[3]\n", |
| "{'col1': 1}\n", |
| "{'col1': 2}\n", |
| "{'col1': 3}" |
| ] |
| }, |
| "execution_count": 66, |
| "metadata": {}, |
| "output_type": "execute_result" |
| } |
| ], |
| "source": [ |
| "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n", |
| "na.ArrayStream(reader).read_all()" |
| ] |
| }, |
| { |
| "attachments": {}, |
| "cell_type": "markdown", |
| "metadata": {}, |
| "source": [ |
| "## Development\n", |
| "\n", |
| "Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).\n", |
| "This means you can build the project using:\n", |
| "\n", |
| "```shell\n", |
| "git clone https://github.com/apache/arrow-nanoarrow.git\n", |
| "cd arrow-nanoarrow/python\n", |
| "pip install -e .\n", |
| "```\n", |
| "\n", |
| "Tests use [pytest](https://docs.pytest.org/):\n", |
| "\n", |
| "```shell\n", |
| "# Install dependencies\n", |
| "pip install -e \".[test]\"\n", |
| "\n", |
| "# Run tests\n", |
| "pytest -vvx\n", |
| "```\n", |
| "\n", |
| "CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree." |
| ] |
| } |
| ], |
| "metadata": { |
| "kernelspec": { |
| "display_name": "Python 3", |
| "language": "python", |
| "name": "python3" |
| }, |
| "language_info": { |
| "codemirror_mode": { |
| "name": "ipython", |
| "version": 3 |
| }, |
| "file_extension": ".py", |
| "mimetype": "text/x-python", |
| "name": "python", |
| "nbconvert_exporter": "python", |
| "pygments_lexer": "ipython3", |
| "version": "3.12.3" |
| }, |
| "orig_nbformat": 4 |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 2 |
| } |