feat(python): Add CArrayView -> Python conversion (#391)

This PR adds a framework for Python object creation from arrays and
array streams with implementations for most arrow types. Notably, it
includes implementations for nested types (struct, list, dictionary) to
make sure that the framework won't have to be completely rewritten to
accommodate them. A few types (decimal, datetime) aren't supported but
should be reasonably easy to implement by wrapping existing iterator
factories included in this PR.

None of these are exposed with `import nanoarrow as na` yet...I'm
anticipating that the user-facing `nanoarrow.Array` and/or
`nanoarrow.ArrayStream` to use the implementation here in methods.

A few changes were required at a lower level to make this work:

- It is now possible to use nanoarrow's `ArrowBasicArrayStream`
implementation to create a stream from a previously-resolved list of
arrays. This makes it easier to test streams since before we had no way
to create them.
- The constructor for `c_array_stream()` now falls back on `c_array()`
by wrapping it in a length-one stream. This makes it easier to write
generic code that takes stream-like input (like the iterator).
- The `ArrowLayout` needed to be exposed to implement the fixed-size
list implementation.
- I added tests for all the lower level changes, which I did in
dedicated files. Some of these tests overlap with existing tests in
test_nanoarrow...at some point we should go through test_nanoarrow and
separate the tests (or create an integration test section since many of
those early tests assumed pyarrow was available).

The implementation seems to be efficient given the constraint that
assembling the iterators is currently done using Python code.

```python
import numpy as np
import pyarrow as pa
from nanoarrow import iterator

n = int(1e6)
n_cols = 10
arrays = [np.random.random(n) for _ in range(n_cols)]
batch = pa.record_batch(
    arrays,
    names=[f"col{i}" for i in range(n_cols)]
)

%timeit list(iterator.itertuples(batch))
#> 256 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Just zipping the arrays
%timeit list(zip(*arrays))
#> 335 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# A few ways to do this from pyarrow
%timeit list(zip(*(col.to_pylist() for col in batch.columns)))
#> 1.99 s ± 52.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(zip(*(col.to_numpy() for col in batch.columns)))
#> 315 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Works if all columns are the same type (but rows are arrays, not tuples)
%timeit list(np.array(batch))
#> 131 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Test some nested things
n = int(1e4)
n_cols = 10
big_list = [["a", "b", "c", "d", "e"]] * n

arrays = [big_list for _ in range(n_cols)]
batch = pa.record_batch(
    arrays,
    names=[f"col{i}" for i in range(n_cols)]
)

%timeit list(iterator.itertuples(batch))
#> 89.2 ms ± 756 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit list(zip(*(col.to_pylist() for col in batch.columns)))
#> 288 ms ± 748 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
11 files changed
tree: b0c062c9135a2dbf9d66ce0953eeb3366d8611b3
  1. .github/
  2. ci/
  3. dev/
  4. dist/
  5. docs/
  6. examples/
  7. extensions/
  8. python/
  9. r/
  10. src/
  11. thirdparty/
  12. .asf.yaml
  13. .clang-format
  14. .cmake-format
  15. .env
  16. .flake8
  17. .gitattributes
  18. .gitignore
  19. .isort.cfg
  20. .pre-commit-config.yaml
  21. CHANGELOG.md
  22. CMakeLists.txt
  23. CMakePresets.json
  24. CMakeUserPresets.json.example
  25. docker-compose.yml
  26. LICENSE.txt
  27. NOTICE.txt
  28. README.md
  29. valgrind.supp
README.md

nanoarrow

Codecov test coverage Documentation nanoarrow on GitHub

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}