commit	93b127327561aa74ae91fe5e690ced59e1276f11	[log] [tgz]
author	Dewey Dunnington <dewey@dunnington.ca>	Tue Apr 23 20:58:52 2024 -0300
committer	GitHub <noreply@github.com>	Tue Apr 23 20:58:52 2024 -0300
tree	c6af5d13333c0aad0e687720af1f36a54b1fa04b
parent	c677d4d396e75d362a626db6c56207ef4ee4befa [diff]

feat(python): Iterate over array buffers (#433)

The idea with this change is to support efficient buffer access for
chunked/streaming input (e.g., make a numpy array). The efficient
implementation is compact but I am not sure it is easy to guess for
anybody not familiar with nanoarrow internals:

```python
with c_array_stream(obj, schema) as stream:
        for array in stream:
            view = array.view()
```

I'm not sure that `iter_chun_data()` is the best name here, but one
would use it like:

```python
import nanoarrow as na

array = na.Array([1, 2, 3], na.int32())

for view in array.iter_chunk_data():
    print(view.offset, view.length, list(view.buffers))
#> 0 3 [nanoarrow.c_lib.CBufferView(bool[0 b] ), nanoarrow.c_lib.CBufferView(int32[12 b] 1 2 3)]
```

This would replace `iter_buffers()` which is a little dangerous to use
(since one might assume the whole buffer represents the array, where we
really need the offset everywhere one might access a buffer). It also
cleans up some of the `ArrayViewIterator` terminology (since an earlier
version of this used the `ArrayViewIterator` instead of the simpler
approach it now uses).

The below benchmark is engineered to find the point where a this
iterator would be slower than `pa.ChunkedArray.to_numpy()` (for a
million doubles in this specific example, PyArrow becomes faster between
100 and 1000 chunks).

```python
import nanoarrow as na
from nanoarrow import c_lib
import pyarrow as pa
import numpy as np

n = int(1e6)
chunk_size = int(1e4)
num_chunks = n // chunk_size
n = chunk_size * num_chunks

chunks = [na.c_array(np.random.random(chunk_size)) for i in range(num_chunks)]
array = na.Array(c_lib.CArrayStream.from_array_list(chunks, na.c_schema(na.float64())))

def make():
    out = np.empty(len(array), dtype=np.float64)

    cursor = 0
    for view in array.iter_chunk_data():
        offset = view.offset
        length = view.length
        data = np.array(view.buffer(1), copy=False)[offset:(offset + length)]
        out[cursor:(cursor + length)] = np.array(data, copy=False)
        cursor += length

    return out


%timeit make()
#> 749 µs ± 37.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

chunked = pa.chunked_array([pa.array(item) for item in chunks])
%timeit chunked.to_numpy()
#> 2.07 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# With 1000 chunks of size 1000, the number would be
# iter_array_view()
#> 3.02 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# chunked.to_numpy()
#> 2.07 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.testing.assert_equal(make(), chunked.to_numpy())
```

---------

Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>

4 files changed

tree: c6af5d13333c0aad0e687720af1f36a54b1fa04b

README.md

nanoarrow

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanoarrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}

Building with Meson

CMake is the officially supported build system for nanoarrow. However, the Meson backend is an experimental feature you may also wish to try.

To run the test suite with Meson, you will want to first install the testing dependencies via the wrap database (n.b. no wrap database entry exists for Arrow - that must be installed separately).

mkdir subprojects
meson wrap install gtest
meson wrap install google-benchmark
meson wrap install nlohmann_json

The Arrow C++ library must also be discoverable via pkg-config build tests.

You can then set up your build directory:

meson setup builddir
cd builddir

And configure your project (this could have also been done inline with setup)

meson configure -DNANOARROW_BUILD_TESTS=true -DNANOARROW_BUILD_BENCHMARKS=true

Note that if your Arrow pkg-config profile is installed in a non-standard location on your system, you may pass the --pkg-config-path <path to directory with arrow.pc> to either the setup or configure steps above.

With the above out of the way, the compile command should take care of the rest:

meson compile

Upon a successful build you can execute the test suite and benchmarks with the following commands:

meson test nanoarrow:  # default test run
meson test nanoarrow: --wrap valgrind  # run tests under valgrind
meson test nanoarrow: --benchmark --verbose # run benchmarks