commit	094815189e43e4bb37bafef8dab124ed1533066b	[log] [tgz]
author	Dewey Dunnington <dewey@dunnington.ca>	Tue Apr 16 12:50:54 2024 -0300
committer	GitHub <noreply@github.com>	Tue Apr 16 12:50:54 2024 -0300
tree	d728539aff2aa3e7d4195e847a6597c6779d153c
parent	917e8e7dbe17efb5ba4a1e859181f6fb48aa2fa5 [diff]

feat(python): Create string/binary arrays from iterables (#430)

This PR adds support for building string and binary arrays via iterable.

It also cleans up a few parts of #426 that resulted in the wheel builds
failing for (at least) PyPy 3.8 and 3.9. We can circle back to the
performance of building from iterables (and whether or not `pack_into()`
is essential) when all the wheels are building reliably.

```python
import nanoarrow as na

strings = ["pizza", "yogurt", "noodles", "peanut butter sandwiches"]

na.Array(strings, na.string())
#> nanoarrow.Array<string>[4]
#> 'pizza'
#> 'yogurt'
#> 'noodles'
#> 'peanut butter sandwiches'

na.Array((s.encode() for s in strings), na.binary())
#> nanoarrow.Array<binary>[4]
#> b'pizza'
#> b'yogurt'
#> b'noodles'
#> b'peanut butter sandwiches'
```

The "build from iterable" code is now sufficiently complicated that it
should be separated out. I did an initial attempt at that for this PR;
however, it scrambles things up a bit and is complicated by the
interdependence between the functions that sanitize arguments (e.g.,
`c_schema()`, `c_array()`) and the functions that build from iterable.

Currently faster for strings and slightly slower for bytes than pyarrow.

```python
from itertools import cycle, islice
import nanoarrow as na
import pyarrow as pa

strings = ["pizza", "yogurt", "noodles", "peanut butter sandwiches"]
binary = [s.encode() for s in strings]

def many_strings():
    return islice(cycle(strings), int(1e6))

def many_strings_with_nulls():
    return islice(cycle(strings + [None]), int(1e6))

def many_bytes():
    return islice(cycle(binary), int(1e6))

def many_bytes_with_nulls():
    return islice(cycle(binary + [None]), int(1e6))

%timeit pa.array(many_strings(), pa.string())
#> 23.4 ms ± 488 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(many_strings(), na.string())
#> 14.3 ms ± 112 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pa.array(many_strings_with_nulls(), pa.string())
#> 21.4 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(many_strings_with_nulls(), na.string())
#> 17.1 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pa.array(many_bytes(), pa.binary())
#> 19.7 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(many_bytes(), na.binary())
#> 16.3 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit pa.array(many_bytes_with_nulls(), pa.binary())
#> 17.6 ms ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit na.c_array(many_bytes_with_nulls(), na.binary())
#> 19 ms ± 378 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

6 files changed

tree: d728539aff2aa3e7d4195e847a6597c6779d153c

README.md

nanoarrow

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanoarrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}

Building with Meson

CMake is the officially supported build system for nanoarrow. However, the Meson backend is an experimental feature you may also wish to try.

To run the test suite with Meson, you will want to first install the testing dependencies via the wrap database (n.b. no wrap database entry exists for Arrow - that must be installed separately).

mkdir subprojects
meson wrap install gtest
meson wrap install google-benchmark
meson wrap install nlohmann_json

The Arrow C++ library must also be discoverable via pkg-config build tests.

You can then set up your build directory:

meson setup builddir
cd builddir

And configure your project (this could have also been done inline with setup)

meson configure -DNANOARROW_BUILD_TESTS=true -DNANOARROW_BUILD_BENCHMARKS=true

Note that if your Arrow pkg-config profile is installed in a non-standard location on your system, you may pass the --pkg-config-path <path to directory with arrow.pc> to either the setup or configure steps above.

With the above out of the way, the compile command should take care of the rest:

meson compile

Upon a successful build you can execute the test suite and benchmarks with the following commands:

meson test nanoarrow:  # default test run
meson test nanoarrow: --wrap valgrind  # run tests under valgrind
meson test nanoarrow: --benchmark --verbose # run benchmarks