commit | 7cf50a3208c0d55f5cd230f611bef808d8683832 | [log] [tgz] |
---|---|---|
author | Dewey Dunnington <dewey@dunnington.ca> | Fri Mar 01 09:45:52 2024 -0400 |
committer | GitHub <noreply@github.com> | Fri Mar 01 09:45:52 2024 -0400 |
tree | b0c062c9135a2dbf9d66ce0953eeb3366d8611b3 | |
parent | 7e601cc06648fa35f69144c06135aa597b4e9def [diff] |
feat(python): Add CArrayView -> Python conversion (#391) This PR adds a framework for Python object creation from arrays and array streams with implementations for most arrow types. Notably, it includes implementations for nested types (struct, list, dictionary) to make sure that the framework won't have to be completely rewritten to accommodate them. A few types (decimal, datetime) aren't supported but should be reasonably easy to implement by wrapping existing iterator factories included in this PR. None of these are exposed with `import nanoarrow as na` yet...I'm anticipating that the user-facing `nanoarrow.Array` and/or `nanoarrow.ArrayStream` to use the implementation here in methods. A few changes were required at a lower level to make this work: - It is now possible to use nanoarrow's `ArrowBasicArrayStream` implementation to create a stream from a previously-resolved list of arrays. This makes it easier to test streams since before we had no way to create them. - The constructor for `c_array_stream()` now falls back on `c_array()` by wrapping it in a length-one stream. This makes it easier to write generic code that takes stream-like input (like the iterator). - The `ArrowLayout` needed to be exposed to implement the fixed-size list implementation. - I added tests for all the lower level changes, which I did in dedicated files. Some of these tests overlap with existing tests in test_nanoarrow...at some point we should go through test_nanoarrow and separate the tests (or create an integration test section since many of those early tests assumed pyarrow was available). The implementation seems to be efficient given the constraint that assembling the iterators is currently done using Python code. ```python import numpy as np import pyarrow as pa from nanoarrow import iterator n = int(1e6) n_cols = 10 arrays = [np.random.random(n) for _ in range(n_cols)] batch = pa.record_batch( arrays, names=[f"col{i}" for i in range(n_cols)] ) %timeit list(iterator.itertuples(batch)) #> 256 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # Just zipping the arrays %timeit list(zip(*arrays)) #> 335 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # A few ways to do this from pyarrow %timeit list(zip(*(col.to_pylist() for col in batch.columns))) #> 1.99 s ± 52.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit list(zip(*(col.to_numpy() for col in batch.columns))) #> 315 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # Works if all columns are the same type (but rows are arrays, not tuples) %timeit list(np.array(batch)) #> 131 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) # Test some nested things n = int(1e4) n_cols = 10 big_list = [["a", "b", "c", "d", "e"]] * n arrays = [big_list for _ in range(n_cols)] batch = pa.record_batch( arrays, names=[f"col{i}" for i in range(n_cols)] ) %timeit list(iterator.itertuples(batch)) #> 89.2 ms ± 756 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit list(zip(*(col.to_pylist() for col in batch.columns))) #> 288 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) ```
The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.
Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.
The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.
A simple producer example:
#include "nanoarrow.h" int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) { struct ArrowError error; array_out->release = NULL; schema_out->release = NULL; NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32)); NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2)); NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3)); NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error)); NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32)); return NANOARROW_OK; }
A simple consumer example:
#include <stdio.h> #include "nanoarrow.h" int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) { struct ArrowError error; struct ArrowArrayView array_view; NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error)); if (array_view.storage_type != NANOARROW_TYPE_INT32) { printf("Array has storage that is not int32\n"); } int result = ArrowArrayViewSetArray(&array_view, array, &error); if (result != NANOARROW_OK) { ArrowArrayViewReset(&array_view); return result; } for (int64_t i = 0; i < array->length; i++) { printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i)); } ArrowArrayViewReset(&array_view); return NANOARROW_OK; }