commit | 7b8a7c85ad7456c5e6181bdae1920ddace8f9044 | [log] [tgz] |
---|---|---|
author | Dewey Dunnington <dewey@dunnington.ca> | Fri Dec 06 21:51:35 2024 +0000 |
committer | GitHub <noreply@github.com> | Fri Dec 06 15:51:35 2024 -0600 |
tree | 7914feaa4db67270d0256ef121deb331a4e921db | |
parent | e54b7df525fa1d310a96687bd99902823402b26c [diff] |
feat(python): Implement extension type/mechanism in python package (#688) This PR implements the first canonical extension type for nanoarrow/Python: `nanoarrow.bool8()`. In doing so it also implements some machinery for "extensions", which is intended to be internal and to evolve with the requirements of some follow-up canonical extension types. The extension points are: - Parsing the metadata when `Schema.extension` is accessed to get parameter access (and validate the metadata) - Converting an extension array to Python objects (i.e., `array.to_pylist()`) - Converting an extension array to a Python sequence (i.e., `array.to_pysequence()`) - Constructing an extension array from Python objects (i.e., `na.Array([True, False, None], na.bool8())` - Constructing an extension array from a Python buffer (i.e., `na.Array(np.array([True, False, False]), na.bool8())` I am not sure the extension point implementations via the `Extension` methods are the final way this should be done, but this PR at least connects the wires. The tensor extensions are more complex and will require some modification of this, but I'd like to do that in another PR. ```python import nanoarrow as na import numpy as np na.Array([True, False, None], na.bool8()) #> nanoarrow.Array<arrow.bool8{int8}>[3] #> True #> False #> None na.Array(np.array([True, False, True]), na.bool8()) #> nanoarrow.Array<arrow.bool8{int8}>[3] #> True #> False #> True na.Array([True, False, None], na.bool8()).to_pylist() #> [True, False, None] np.array(na.Array([True, False, True], na.bool8()).to_pysequence()) #> array([ True, False, True]) ```
The nanoarrow libraries are a set of helpers to produce and consume Arrow data, including the Arrow C Data, Arrow C Stream, and Arrow C Device, structures and the serialized Arrow IPC format. The vision of nanoarrow is that it should be trivial for libraries to produce and consume Arrow data: it helps fulfill this vision by providing high-quality, easy-to-adopt helpers to produce, consume, and test Arrow data types and arrays.
The nanoarrow libraries were built to be:
The nanoarrow Python bindings are available from PyPI and conda-forge:
pip install nanoarrow conda install nanoarrow -c conda-forge
The nanoarrow R package is available from CRAN:
install.packages("nanoarrow")
The C library can be used by generating bundled versions of the core library and its components. This is the version used internally by the R and Python bindings.
python ci/scripts/bundle.py \ --source-output-dir=dist \ --include-output-dir=dist \ --header-namespace= \ --with-device \ --with-ipc \ --with-testing \ --with-flatcc
CMake is also supported via a build/install with find_package()
or using FetchContent
:
fetchcontent_declare(nanoarrow URL "https://www.apache.org/dyn/closer.lua?action=download&filename=arrow/nanoarrow-0.5.0/apache-arrow-0.5.0.tar.gz") fetchcontent_makeavailable(nanoarrow)
The C library can also be used as a Meson subproject installed with:
mkdir subprojects
meson wrap install nanoarrow
...and declared as a dependency with:
nanoarrow_dep = dependency('nanoarrow') example_exec = executable('example_meson_minimal_app', 'src/app.cc', dependencies: [nanoarrow_dep])
See the nanoarrow Documentation for extended tutorials and API reference for the C, C++, Python, and R libraries.
The nanoarrow GitHub repository additionally provides a number of examples covering how to use nanoarrow in a variety of build configurations.
CMake is the primary build system used to develop and test the nanoarrow C library. You can build nanoarrow with:
mkdir build && cd build cmake .. cmake --build .
To build nanoarrow along with tests run:
mkdir build && cd build cmake .. -DNANOARROW_BUILD_TESTS=ON cmake --build .
If you are able to install Arrow C++ you can enable more testing:
mkdir build && cd build cmake .. -DNANOARROW_BUILD_TESTS=ON -DNANOARROW_BUILD_TESTS_WITH_ARROW=ON cmake --build .
Tests can be run with ctest
.
CMake is the officially supported build system for nanoarrow. However, the Meson backend is an experimental feature you may also wish to try.
meson setup builddir
cd builddir
After setting up your project, be sure to enable the options you want:
meson configure -Dtests=true -Dbenchmarks=true
You can enable better test coverage if Apache Arrow is installed on your system with -Dtest_with_arrow=true
. Depending on how you have installed Apache Arrow, you may also need to pass --pkg-config-path <path to directory with arrow.pc>
.
With the above out of the way, the compile
command should take care of the rest:
meson compile
Upon a successful build you can execute the test suite and benchmarks with the following commands:
meson test nanoarrow: # default test run meson test nanoarrow: --wrap valgrind # run tests under valgrind meson test nanoarrow: --benchmark --verbose # run benchmarks