python/README.ipynb - arrow-nanoarrow - Git at Google

 {
  "cells": [
   {
    "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "<!---\n",
     "  Licensed to the Apache Software Foundation (ASF) under one\n",
     "  or more contributor license agreements.  See the NOTICE file\n",
     "  distributed with this work for additional information\n",
     "  regarding copyright ownership.  The ASF licenses this file\n",
     "  to you under the Apache License, Version 2.0 (the\n",
     "  \"License\"); you may not use this file except in compliance\n",
     "  with the License.  You may obtain a copy of the License at\n",
     "\n",
     "    http://www.apache.org/licenses/LICENSE-2.0\n",
     "\n",
     "  Unless required by applicable law or agreed to in writing,\n",
     "  software distributed under the License is distributed on an\n",
     "  \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
     "  KIND, either express or implied.  See the License for the\n",
     "  specific language governing permissions and limitations\n",
     "  under the License.\n",
     "-->\n",
     "\n",
     "<!-- Render with jupyter nbconvert --to markdown README.ipynb -->\n",
     "\n",
     "# nanoarrow for Python\n",
     "\n",
     "The nanoarrow Python package provides bindings to the nanoarrow C library. Like\n",
     "the nanoarrow C library, it provides tools to facilitate the use of the\n",
     "[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html) \n",
     "and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html) \n",
     "interfaces.\n",
     "\n",
     "## Installation\n",
     "\n",
     "The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and\n",
     "[conda-forge](https://conda-forge.org/):\n",
     "\n",
     "```shell\n",
     "pip install nanoarrow\n",
     "conda install nanoarrow -c conda-forge\n",
     "```\n",
     "\n",
     "Development versions (based on the `main` branch) are also available:\n",
     "\n",
     "```shell\n",
     "pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \\\n",
     "    --prefer-binary --pre nanoarrow\n",
     "```\n",
     "\n",
     "If you can import the namespace, you're good to go!"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 46,
    "metadata": {},
    "outputs": [],
    "source": [
     "import nanoarrow as na"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Data types, arrays, and array streams\n",
     "\n",
     "The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 47,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "<Schema> int32"
       ]
      },
      "execution_count": 47,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "na.int32()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 48,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "nanoarrow.Array<int32>[3]\n",
        "1\n",
        "2\n",
        "3"
       ]
      },
      "execution_count": 48,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "na.Array([1, 2, 3], na.int32())"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 49,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "nanoarrow.Array<int32>[6]\n",
        "1\n",
        "2\n",
        "3\n",
        "4\n",
        "5\n",
        "6"
       ]
      },
      "execution_count": 49,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())\n",
     "chunked"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 50,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "nanoarrow.ArrayStream<int32>"
       ]
      },
      "execution_count": 50,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "stream = na.ArrayStream(chunked)\n",
     "stream"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 51,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "nanoarrow.Array<int32>[3]\n",
       "1\n",
       "2\n",
       "3\n",
       "nanoarrow.Array<int32>[3]\n",
       "4\n",
       "5\n",
       "6\n"
      ]
     }
    ],
    "source": [
     "with stream:\n",
     "    for chunk in stream:\n",
     "        print(chunk)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 52,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>"
       ]
      },
      "execution_count": 52,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "url = \"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows\"\n",
     "na.ArrayStream.from_url(url)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 53,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "pyarrow.Field<: int32>"
       ]
      },
      "execution_count": 53,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "import pyarrow as pa\n",
     "\n",
     "pa.field(na.int32())"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 54,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "<pyarrow.lib.ChunkedArray object at 0x12a49a250>\n",
        "[\n",
        "  [\n",
        "    1,\n",
        "    2,\n",
        "    3\n",
        "  ],\n",
        "  [\n",
        "    4,\n",
        "    5,\n",
        "    6\n",
        "  ]\n",
        "]"
       ]
      },
      "execution_count": 54,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "pa.chunked_array(chunked)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 55,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "<pyarrow.lib.Int32Array object at 0x11b552500>\n",
        "[\n",
        "  4,\n",
        "  5,\n",
        "  6\n",
        "]"
       ]
      },
      "execution_count": 55,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "pa.array(chunked.chunk(1))"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 56,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "nanoarrow.Array<int64>[3]\n",
        "10\n",
        "11\n",
        "12"
       ]
      },
      "execution_count": 56,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "na.Array(pa.array([10, 11, 12]))"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 57,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "<Schema> string"
       ]
      },
      "execution_count": 57,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "na.Schema(pa.string())"
    ]
   },
   {
    "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Low-level C library bindings\n",
     "\n",
     "The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.\n",
     "\n",
     "### Schemas\n",
     "\n",
     "Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 58,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "<nanoarrow.c_schema.CSchema decimal128(10, 3)>\n",
        "- format: 'd:10,3'\n",
        "- name: ''\n",
        "- flags: 2\n",
        "- metadata: NULL\n",
        "- dictionary: NULL\n",
        "- children[0]:"
       ]
      },
      "execution_count": 58,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "na.c_schema(pa.decimal128(10, 3))"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 59,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "(10, 3)"
       ]
      },
      "execution_count": 59,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "schema = na.Schema(pa.decimal128(10, 3))\n",
     "schema.precision, schema.scale"
    ]
   },
   {
    "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released."
    ]
   },
   {
    "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Arrays\n",
     "\n",
     "You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 60,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "<nanoarrow.c_array.CArray string>\n",
        "- length: 4\n",
        "- offset: 0\n",
        "- null_count: 1\n",
        "- buffers: (4754305168, 4754307808, 4754310464)\n",
        "- dictionary: NULL\n",
        "- children[0]:"
       ]
      },
      "execution_count": 60,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "na.c_array([\"one\", \"two\", \"three\", None], na.string())"
    ]
   },
   {
    "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 61,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "['one', 'two', 'three', None]"
       ]
      },
      "execution_count": 61,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "array = na.Array([\"one\", \"two\", \"three\", None], na.string())\n",
     "array.to_pylist()"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 62,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),\n",
        " nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),\n",
        " nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))"
       ]
      },
      "execution_count": 62,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "array.buffers"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Advanced users can create arrays directly from buffers using `c_array_from_buffers()`:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 63,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "<nanoarrow.c_array.CArray string>\n",
        "- length: 2\n",
        "- offset: 0\n",
        "- null_count: 0\n",
        "- buffers: (0, 5002908320, 4999694624)\n",
        "- dictionary: NULL\n",
        "- children[0]:"
       ]
      },
      "execution_count": 63,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "na.c_array_from_buffers(\n",
     "    na.string(),\n",
     "    2,\n",
     "    [None, na.c_buffer([0, 3, 6], na.int32()), b\"abcdef\"]\n",
     ")"
    ]
   },
   {
    "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Array streams\n",
     "\n",
     "You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`)."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 64,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "<nanoarrow.c_array_stream.CArrayStream>\n",
        "- get_schema(): struct<col1: int64>"
       ]
      },
      "execution_count": 64,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "pa_batch = pa.record_batch({\"col1\": [1, 2, 3]})\n",
     "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n",
     "array_stream = na.c_array_stream(reader)\n",
     "array_stream"
    ]
   },
   {
    "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 65,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "<nanoarrow.c_array.CArray struct<col1: int64>>\n",
       "- length: 3\n",
       "- offset: 0\n",
       "- null_count: 0\n",
       "- buffers: (0,)\n",
       "- dictionary: NULL\n",
       "- children[1]:\n",
       "  'col1': <nanoarrow.c_array.CArray int64>\n",
       "    - length: 3\n",
       "    - offset: 0\n",
       "    - null_count: 0\n",
       "    - buffers: (0, 2642948588352)\n",
       "    - dictionary: NULL\n",
       "    - children[0]:\n"
      ]
     }
    ],
    "source": [
     "for array in array_stream:\n",
     "    print(array)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "Use `ArrayStream()` for a higher level interface:"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": 66,
    "metadata": {},
    "outputs": [
     {
      "data": {
       "text/plain": [
        "nanoarrow.Array<non-nullable struct<col1: int64>>[3]\n",
        "{'col1': 1}\n",
        "{'col1': 2}\n",
        "{'col1': 3}"
       ]
      },
      "execution_count": 66,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
     "reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n",
     "na.ArrayStream(reader).read_all()"
    ]
   },
   {
    "attachments": {},
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "## Development\n",
     "\n",
     "Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).\n",
     "This means you can build the project using:\n",
     "\n",
     "```shell\n",
     "git clone https://github.com/apache/arrow-nanoarrow.git\n",
     "cd arrow-nanoarrow/python\n",
     "pip install -e .\n",
     "```\n",
     "\n",
     "Tests use [pytest](https://docs.pytest.org/):\n",
     "\n",
     "```shell\n",
     "# Install dependencies\n",
     "pip install -e \".[test]\"\n",
     "\n",
     "# Run tests\n",
     "pytest -vvx\n",
     "```\n",
     "\n",
     "CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree."
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
     "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.12.3"
   },
   "orig_nbformat": 4
  },
  "nbformat": 4,
  "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"<!---\n",
	" Licensed to the Apache Software Foundation (ASF) under one\n",
	" or more contributor license agreements. See the NOTICE file\n",
	" distributed with this work for additional information\n",
	" regarding copyright ownership. The ASF licenses this file\n",
	" to you under the Apache License, Version 2.0 (the\n",
	" \"License\"); you may not use this file except in compliance\n",
	" with the License. You may obtain a copy of the License at\n",
	"\n",
	" http://www.apache.org/licenses/LICENSE-2.0\n",
	"\n",
	" Unless required by applicable law or agreed to in writing,\n",
	" software distributed under the License is distributed on an\n",
	" \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
	" KIND, either express or implied. See the License for the\n",
	" specific language governing permissions and limitations\n",
	" under the License.\n",
	"-->\n",
	"\n",
	"<!-- Render with jupyter nbconvert --to markdown README.ipynb -->\n",
	"\n",
	"# nanoarrow for Python\n",
	"\n",
	"The nanoarrow Python package provides bindings to the nanoarrow C library. Like\n",
	"the nanoarrow C library, it provides tools to facilitate the use of the\n",
	"[Arrow C Data](https://arrow.apache.org/docs/format/CDataInterface.html) \n",
	"and [Arrow C Stream](https://arrow.apache.org/docs/format/CStreamInterface.html) \n",
	"interfaces.\n",
	"\n",
	"## Installation\n",
	"\n",
	"The nanoarrow Python bindings are available from [PyPI](https://pypi.org/) and\n",
	"[conda-forge](https://conda-forge.org/):\n",
	"\n",
	"```shell\n",
	"pip install nanoarrow\n",
	"conda install nanoarrow -c conda-forge\n",
	"```\n",
	"\n",
	"Development versions (based on the `main` branch) are also available:\n",
	"\n",
	"```shell\n",
	"pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \\\n",
	" --prefer-binary --pre nanoarrow\n",
	"```\n",
	"\n",
	"If you can import the namespace, you're good to go!"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 46,
	"metadata": {},
	"outputs": [],
	"source": [
	"import nanoarrow as na"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Data types, arrays, and array streams\n",
	"\n",
	"The Arrow C Data and Arrow C Stream interfaces are comprised of three structures: the `ArrowSchema` which represents a data type of an array, the `ArrowArray` which represents the values of an array, and an `ArrowArrayStream`, which represents zero or more `ArrowArray`s with a common `ArrowSchema`. These concepts map to the `nanoarrow.Schema`, `nanoarrow.Array`, and `nanoarrow.ArrayStream` in the Python package."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<Schema> int32"
	]
	},
	"execution_count": 47,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"na.int32()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 48,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"nanoarrow.Array<int32>[3]\n",
	"1\n",
	"2\n",
	"3"
	]
	},
	"execution_count": 48,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"na.Array([1, 2, 3], na.int32())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The `nanoarrow.Array` can accommodate arrays with any number of chunks, reflecting the reality that many array containers (e.g., `pyarrow.ChunkedArray`, `polars.Series`) support this."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 49,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"nanoarrow.Array<int32>[6]\n",
	"1\n",
	"2\n",
	"3\n",
	"4\n",
	"5\n",
	"6"
	]
	},
	"execution_count": 49,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"chunked = na.Array.from_chunks([[1, 2, 3], [4, 5, 6]], na.int32())\n",
	"chunked"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Whereas chunks of an `Array` are always fully materialized when the object is constructed, the chunks of an `ArrayStream` have not necessarily been resolved yet."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 50,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"nanoarrow.ArrayStream<int32>"
	]
	},
	"execution_count": 50,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"stream = na.ArrayStream(chunked)\n",
	"stream"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 51,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"nanoarrow.Array<int32>[3]\n",
	"1\n",
	"2\n",
	"3\n",
	"nanoarrow.Array<int32>[3]\n",
	"4\n",
	"5\n",
	"6\n"
	]
	}
	],
	"source": [
	"with stream:\n",
	" for chunk in stream:\n",
	" print(chunk)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The `nanoarrow.ArrayStream` also provides an interface to nanoarrow's [Arrow IPC](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) reader:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 52,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"nanoarrow.ArrayStream<non-nullable struct<commit: string, time: timestamp('us', 'UTC'), files: int3...>"
	]
	},
	"execution_count": 52,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"url = \"https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows\"\n",
	"na.ArrayStream.from_url(url)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"These objects implement the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) for both producing and consuming and are interchangeable with `pyarrow` objects in many cases:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 53,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"pyarrow.Field<: int32>"
	]
	},
	"execution_count": 53,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import pyarrow as pa\n",
	"\n",
	"pa.field(na.int32())"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 54,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<pyarrow.lib.ChunkedArray object at 0x12a49a250>\n",
	"[\n",
	" [\n",
	" 1,\n",
	" 2,\n",
	" 3\n",
	" ],\n",
	" [\n",
	" 4,\n",
	" 5,\n",
	" 6\n",
	" ]\n",
	"]"
	]
	},
	"execution_count": 54,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pa.chunked_array(chunked)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 55,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<pyarrow.lib.Int32Array object at 0x11b552500>\n",
	"[\n",
	" 4,\n",
	" 5,\n",
	" 6\n",
	"]"
	]
	},
	"execution_count": 55,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pa.array(chunked.chunk(1))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 56,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"nanoarrow.Array<int64>[3]\n",
	"10\n",
	"11\n",
	"12"
	]
	},
	"execution_count": 56,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"na.Array(pa.array([10, 11, 12]))"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 57,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<Schema> string"
	]
	},
	"execution_count": 57,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"na.Schema(pa.string())"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Low-level C library bindings\n",
	"\n",
	"The nanoarrow Python package also provides lower level wrappers around Arrow C interface structures. You can create these using `nanoarrow.c_schema()`, `nanoarrow.c_array()`, and `nanoarrow.c_array_stream()`.\n",
	"\n",
	"### Schemas\n",
	"\n",
	"Use `nanoarrow.c_schema()` to convert an object to an `ArrowSchema` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Schema`, `pyarrow.DataType`, and `pyarrow.Field`)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 58,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<nanoarrow.c_schema.CSchema decimal128(10, 3)>\n",
	"- format: 'd:10,3'\n",
	"- name: ''\n",
	"- flags: 2\n",
	"- metadata: NULL\n",
	"- dictionary: NULL\n",
	"- children[0]:"
	]
	},
	"execution_count": 58,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"na.c_schema(pa.decimal128(10, 3))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Using `c_schema()` is a good fit for testing and for ephemeral schema objects that are being passed from one library to another. To extract the fields of a schema in a more convenient form, use `Schema()`:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 59,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(10, 3)"
	]
	},
	"execution_count": 59,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"schema = na.Schema(pa.decimal128(10, 3))\n",
	"schema.precision, schema.scale"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The `CSchema` object cleans up after itself: when the object is deleted, the underlying `ArrowSchema` is released."
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Arrays\n",
	"\n",
	"You can use `nanoarrow.c_array()` to convert an array-like object to an `ArrowArray`, wrap it as a Python object, and attach a schema that can be used to interpret its contents. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.Array`, `pyarrow.RecordBatch`)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 60,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<nanoarrow.c_array.CArray string>\n",
	"- length: 4\n",
	"- offset: 0\n",
	"- null_count: 1\n",
	"- buffers: (4754305168, 4754307808, 4754310464)\n",
	"- dictionary: NULL\n",
	"- children[0]:"
	]
	},
	"execution_count": 60,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"na.c_array([\"one\", \"two\", \"three\", None], na.string())"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Using `c_array()` is a good fit for testing and for ephemeral array objects that are being passed from one library to another. For a higher level interface, use `Array()`:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 61,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"['one', 'two', 'three', None]"
	]
	},
	"execution_count": 61,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"array = na.Array([\"one\", \"two\", \"three\", None], na.string())\n",
	"array.to_pylist()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 62,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(nanoarrow.c_lib.CBufferView(bool[1 b] 11100000),\n",
	" nanoarrow.c_lib.CBufferView(int32[20 b] 0 3 6 11 11),\n",
	" nanoarrow.c_lib.CBufferView(string[11 b] b'onetwothree'))"
	]
	},
	"execution_count": 62,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"array.buffers"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Advanced users can create arrays directly from buffers using `c_array_from_buffers()`:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 63,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<nanoarrow.c_array.CArray string>\n",
	"- length: 2\n",
	"- offset: 0\n",
	"- null_count: 0\n",
	"- buffers: (0, 5002908320, 4999694624)\n",
	"- dictionary: NULL\n",
	"- children[0]:"
	]
	},
	"execution_count": 63,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"na.c_array_from_buffers(\n",
	" na.string(),\n",
	" 2,\n",
	" [None, na.c_buffer([0, 3, 6], na.int32()), b\"abcdef\"]\n",
	")"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"### Array streams\n",
	"\n",
	"You can use `nanoarrow.c_array_stream()` to wrap an object representing a sequence of `CArray`s with a common `CSchema` to an `ArrowArrayStream` and wrap it as a Python object. This works for any object implementing the [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface.html) (e.g., `pyarrow.RecordBatchReader`, `pyarrow.ChunkedArray`)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 64,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<nanoarrow.c_array_stream.CArrayStream>\n",
	"- get_schema(): struct<col1: int64>"
	]
	},
	"execution_count": 64,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"pa_batch = pa.record_batch({\"col1\": [1, 2, 3]})\n",
	"reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n",
	"array_stream = na.c_array_stream(reader)\n",
	"array_stream"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can pull the next array from the stream using `.get_next()` or use it like an iterator. The `.get_next()` method will raise `StopIteration` when there are no more arrays in the stream."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 65,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"<nanoarrow.c_array.CArray struct<col1: int64>>\n",
	"- length: 3\n",
	"- offset: 0\n",
	"- null_count: 0\n",
	"- buffers: (0,)\n",
	"- dictionary: NULL\n",
	"- children[1]:\n",
	" 'col1': <nanoarrow.c_array.CArray int64>\n",
	" - length: 3\n",
	" - offset: 0\n",
	" - null_count: 0\n",
	" - buffers: (0, 2642948588352)\n",
	" - dictionary: NULL\n",
	" - children[0]:\n"
	]
	}
	],
	"source": [
	"for array in array_stream:\n",
	" print(array)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Use `ArrayStream()` for a higher level interface:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 66,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"nanoarrow.Array<non-nullable struct<col1: int64>>[3]\n",
	"{'col1': 1}\n",
	"{'col1': 2}\n",
	"{'col1': 3}"
	]
	},
	"execution_count": 66,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"reader = pa.RecordBatchReader.from_batches(pa_batch.schema, [pa_batch])\n",
	"na.ArrayStream(reader).read_all()"
	]
	},
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Development\n",
	"\n",
	"Python bindings for nanoarrow are managed with [setuptools](https://setuptools.pypa.io/en/latest/index.html).\n",
	"This means you can build the project using:\n",
	"\n",
	"```shell\n",
	"git clone https://github.com/apache/arrow-nanoarrow.git\n",
	"cd arrow-nanoarrow/python\n",
	"pip install -e .\n",
	"```\n",
	"\n",
	"Tests use [pytest](https://docs.pytest.org/):\n",
	"\n",
	"```shell\n",
	"# Install dependencies\n",
	"pip install -e \".[test]\"\n",
	"\n",
	"# Run tests\n",
	"pytest -vvx\n",
	"```\n",
	"\n",
	"CMake is currently required to ensure that the vendored copy of nanoarrow in the Python package stays in sync with the nanoarrow sources in the working tree."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.12.3"
	},
	"orig_nbformat": 4
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}