commit | 72a2e67bb3476a926026f128b622e89881204ae9 | [log] [tgz] |
---|---|---|
author | Dewey Dunnington <dewey@dunnington.ca> | Thu Jan 11 19:17:24 2024 +0000 |
committer | GitHub <noreply@github.com> | Thu Jan 11 15:17:24 2024 -0400 |
tree | f4304f01896064e57103e4d23b362c72e2e29557 | |
parent | b636a8f88d6a529ff31da6847ac44b584847e5f6 [diff] |
refactor(python): Document, prefix, and add reprs for C-wrapping classes (#340) This PR was inspired #319 but only addresses the first half (prefixes C-wrapping classes so that the name `nanoarrow.array()` can be used for a future class/constructor that more closely resembles a `pyarrow.Array` or `numpy.Array`. This PR does a few things: - Uses capsules to manage allocate/cleanup of C resources instead of "holder" objects. This eliminated some code and in theory makes it possible to move some pieces out of Cython into C. - Renames any "nanoarrow C library binding" classes to start with `C` (e.g., `Schema` to `CSchema`). I made them slightly more literal as well. Basically, these classes are about accessing the fields of the structure without segfaulting. In a potential future world where we don't use Cython, this is something like what we'd get with auto-generated wrapper classes or thin C++ wrappers with generated binding code. - Opens the door for the user-facing versions of these: `Array`, `Schema`, and an `ArrayStream`. The scope and design of those requires more iteration than this PR allows and would benefit from some other infrastructure to be in place first (e.g., convert to/from Python) To make it a little more clear what the existing structures actually are and what they can do, I added `repr()`s for them and updated the README. Briefly: ```python import nanoarrow as na import pyarrow as pa na.cschema(pa.int32()) #> <nanoarrow.clib.CSchema int32> #> - format: 'i' #> - name: '' #> - flags: 2 #> - metadata: NULL #> - dictionary: NULL #> - children[0]: na.cschema_view(pa.timestamp('s', "America/Halifax")) #> <nanoarrow.clib.CSchemaView> #> - type: 'timestamp' #> - storage_type: 'int64' #> - time_unit: 's' #> - timezone: 'America/Halifax' na.carray(pa.array([1, 2, 3])) #> <nanoarrow.clib.CArray int64> #> - length: 3 #> - offset: 0 #> - null_count: 0 #> - buffers: (0, 3354772373824) #> - dictionary: NULL #> - children[0]: na.carray_view(pa.array([1, 2, 3])) #> <nanoarrow.clib.CArrayView> #> - storage_type: 'int64' #> - length: 3 #> - offset: 0 #> - null_count: 0 #> - buffers[2]: #> - <bool validity[0 b] > #> - <int64 data[24 b] 1 2 3> #> - dictionary: NULL #> - children[0]: pa_array_child = pa.array([1, 2, 3], pa.int32()) pa_array = pa.record_batch([pa_array_child], names=["some_column"]) reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array]) na.carray_stream(reader) #> <nanoarrow.clib.CArrayStream> #> - get_schema(): struct<some_column: int32> ``` This involved fixing the existing `BufferView` since to print their contents in a repr-friendly way the elements had to be accessed. I think the `BufferView` will see some changes but it does seem relatively performant: ```python import pyarrow as pa import nanoarrow as na import numpy as np n = int(1e6) pa_array = pa.array(np.random.random(n)) na_array_view = na.carray_view(pa_array) %timeit pa_array.to_pylist() #> 169 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit %timeit list(na_array_view.buffer(1)) #> 33.8 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` --------- Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.
Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.
The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.
A simple producer example:
#include "nanoarrow.h" int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) { struct ArrowError error; array_out->release = NULL; schema_out->release = NULL; NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32)); NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3)); NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error)); NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32)); return NANOARROW_OK; }
A simple consumer example:
#include <stdio.h> #include "nanoarrow.h" int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) { struct ArrowError error; struct ArrowArrayView array_view; NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error)); if (array_view.storage_type != NANOARROW_TYPE_INT32) { printf("Array has storage that is not int32\n"); } int result = ArrowArrayViewSetArray(&array_view, array, &error); if (result != NANOARROW_OK) { ArrowArrayViewReset(&array_view); return result; } for (int64_t i = 0; i < array->length; i++) { printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i)); } ArrowArrayViewReset(&array_view); return NANOARROW_OK; }