commit | 841c845547d7abb7eb11aa01a651175309293c03 | [log] [tgz] |
---|---|---|
author | Dewey Dunnington <dewey@dunnington.ca> | Mon Feb 19 15:01:38 2024 -0400 |
committer | GitHub <noreply@github.com> | Mon Feb 19 15:01:38 2024 -0400 |
tree | 71345efb18a32f5e7073983f495d6a2d5a0d1bf4 | |
parent | 4b6717fc7e0161a366207ead5e9a30fbeca7fada [diff] |
feat(python): Add array creation/building from buffers (#378) The gist of this PR is that I'd like the ability to create arrays for testing without pyarrow so that nanoarrow's tests can run in more places. Other than building/running in odd corner-case environments, nanoarrow in R has been great at prototyping and/or creating test data (e.g., an array with a non-zero offset, an array with a rarely-used type). This is useful for both nanoarrow to test itself and perhaps others who might want to use nanoarrow in a similar way in Python. This is a bit big...I did need to put all of it in one place to figure out what the end point was; however, I'm happy to split into smaller self-contained bits now that I know where I'm headed. After this PR, we can create an array out-of-the-box from anything that supports the buffer protocol. Importantly, this includes numpy arrays so that you can do things like generate arrays with `n` random numbers. ```python import nanoarrow as na import numpy as np ``` ```python na.c_array_view(b"12345") ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'uint8' - length: 5 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <uint8[5 b] 49 50 51 52 53> - dictionary: NULL - children[0]: ```python na.c_array_view(np.array([1, 2, 3], np.int32)) ``` ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: ``` While not built in to the main `c_array()` constructor, we can also now assemble an array from buffers. This has been very useful in R and ensures that we can construct just about any array if we need to. ```python array = na.c_array_from_buffers( na.struct([na.int32()]), length=3, buffers=[None], children=[ na.c_array_from_buffers( na.int32(), length=3, buffers=[None, na.c_buffer([1, 2, 3], na.int32())] ) ], ) na.c_array_view(array) ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'struct' - length: 3 - offset: 0 - null_count: 0 - buffers[1]: - validity <bool[0 b] > - dictionary: NULL - children[1]: - <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: I also added the ability to construct a buffer from an iterable and wired that into the `c_array()` constructor although this is probably not all that fast. It does, however, make it much easier to write tests (because many of them currently start with `na_c_array(pa.array([1, 2, 3]))`. ```python na.c_array_view([1, 2, 3], na.int32()) ``` <nanoarrow.c_lib.CArrayView> - storage_type: 'int32' - length: 3 - offset: 0 - null_count: 0 - buffers[2]: - validity <bool[0 b] > - data <int32[12 b] 1 2 3> - dictionary: NULL - children[0]: This allows creating an array from anything supported by the `struct` module which means we can create some of the less frequently used types. ```python na.c_array_view([1, 2, 3], na.float16()) ``` CBuffer(half_float[6 b] 1.0 2.0 3.0) ```python na.c_array_view([(1, 2), (3, 4), (5, 6)], na.interval_day_time()) ``` CBuffer(interval_day_time[24 b] (1, 2) (3, 4) (5, 6)) Because it's mentaly exhausting to bitpack buffers in my head and because Arrow uses them all the time, I also think it's mission-critical to be able to create bitmaps: ```python na.c_buffer([True, False, True, True], na.bool()) ``` CBuffer(bool[1 b] 10110000) This involved fixing some issues with the existing buffer view: - The buffer view only ever saved a pointer to the device. This is a bit of a problem because even though the CPU device is static and lives forever, CUDA "device" objects will probably keep a CUDA context alive. Thus, we need a strong reference to the `CDevice` Python object (which ensures the underlying nanoarrow `Device*` remains valid). - The buffer view only handled `BufferView` input where technically all it needs is a pointer and a length. This opens it up to represent other types of buffers than just something from nanoarrow (e.g., imported from dlpack or buffer protocol). Implementing the buffer protocol as a consumer was done by wrapping the `ArrowBuffer` with a "deallocator" that holds the `Py_buffer` and ensures it is released. I still need to do some testing to ensure that it's actually released and that we're not leaking memory. This is how I do it in R and in geoarrow-c (Python) as well. Using the `ArrowBuffer` is helpful because the C-level array builder uses them to manage the memory and ensures they're all released when the array is released. Implementing the build-from-iterable involved a few more things...notably, completing the "python struct format string" <-> "arrow data type" conversion. This allows the use of `struct.pack()` which takes care of things like half-float conversion and tuples of day, month, nano conversion. I'm aware this could use a bit better documentation of the added classes/methods...I am assuming these will be internal for the time being but they definitely need a bit more than is currently there. --------- Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.
Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.
The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.
A simple producer example:
#include "nanoarrow.h" int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) { struct ArrowError error; array_out->release = NULL; schema_out->release = NULL; NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32)); NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3)); NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error)); NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32)); return NANOARROW_OK; }
A simple consumer example:
#include <stdio.h> #include "nanoarrow.h" int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) { struct ArrowError error; struct ArrowArrayView array_view; NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error)); if (array_view.storage_type != NANOARROW_TYPE_INT32) { printf("Array has storage that is not int32\n"); } int result = ArrowArrayViewSetArray(&array_view, array, &error); if (result != NANOARROW_OK) { ArrowArrayViewReset(&array_view); return result; } for (int64_t i = 0; i < array->length; i++) { printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i)); } ArrowArrayViewReset(&array_view); return NANOARROW_OK; }