commit	7cf50a3208c0d55f5cd230f611bef808d8683832	[log] [tgz]
author	Dewey Dunnington <dewey@dunnington.ca>	Fri Mar 01 09:45:52 2024 -0400
committer	GitHub <noreply@github.com>	Fri Mar 01 09:45:52 2024 -0400
tree	b0c062c9135a2dbf9d66ce0953eeb3366d8611b3
parent	7e601cc06648fa35f69144c06135aa597b4e9def [diff]

feat(python): Add CArrayView -> Python conversion (#391)

This PR adds a framework for Python object creation from arrays and
array streams with implementations for most arrow types. Notably, it
includes implementations for nested types (struct, list, dictionary) to
make sure that the framework won't have to be completely rewritten to
accommodate them. A few types (decimal, datetime) aren't supported but
should be reasonably easy to implement by wrapping existing iterator
factories included in this PR.

None of these are exposed with `import nanoarrow as na` yet...I'm
anticipating that the user-facing `nanoarrow.Array` and/or
`nanoarrow.ArrayStream` to use the implementation here in methods.

A few changes were required at a lower level to make this work:

- It is now possible to use nanoarrow's `ArrowBasicArrayStream`
implementation to create a stream from a previously-resolved list of
arrays. This makes it easier to test streams since before we had no way
to create them.
- The constructor for `c_array_stream()` now falls back on `c_array()`
by wrapping it in a length-one stream. This makes it easier to write
generic code that takes stream-like input (like the iterator).
- The `ArrowLayout` needed to be exposed to implement the fixed-size
list implementation.
- I added tests for all the lower level changes, which I did in
dedicated files. Some of these tests overlap with existing tests in
test_nanoarrow...at some point we should go through test_nanoarrow and
separate the tests (or create an integration test section since many of
those early tests assumed pyarrow was available).

The implementation seems to be efficient given the constraint that
assembling the iterators is currently done using Python code.

```python
import numpy as np
import pyarrow as pa
from nanoarrow import iterator

n = int(1e6)
n_cols = 10
arrays = [np.random.random(n) for _ in range(n_cols)]
batch = pa.record_batch(
    arrays,
    names=[f"col{i}" for i in range(n_cols)]
)

%timeit list(iterator.itertuples(batch))
#> 256 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Just zipping the arrays
%timeit list(zip(*arrays))
#> 335 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# A few ways to do this from pyarrow
%timeit list(zip(*(col.to_pylist() for col in batch.columns)))
#> 1.99 s ± 52.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(zip(*(col.to_numpy() for col in batch.columns)))
#> 315 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Works if all columns are the same type (but rows are arrays, not tuples)
%timeit list(np.array(batch))
#> 131 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Test some nested things
n = int(1e4)
n_cols = 10
big_list = [["a", "b", "c", "d", "e"]] * n

arrays = [big_list for _ in range(n_cols)]
batch = pa.record_batch(
    arrays,
    names=[f"col{i}" for i in range(n_cols)]
)

%timeit list(iterator.itertuples(batch))
#> 89.2 ms ± 756 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit list(zip(*(col.to_pylist() for col in batch.columns)))
#> 288 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

11 files changed

tree: b0c062c9135a2dbf9d66ce0953eeb3366d8611b3

README.md

nanoarrow

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}