commit	8e8e38d3890a16348c82d1c83a27d8af56085b25	[log] [tgz]
author	Dewey Dunnington <dewey@dunnington.ca>	Mon Apr 15 11:34:13 2024 -0300
committer	GitHub <noreply@github.com>	Mon Apr 15 11:34:13 2024 -0300
tree	baa24548a6be09b371cd580cebd2cd78d329e885
parent	59a281cca2ed63925931f6fd0dc492a7dbbb4530 [diff]

chore(python): Restructure buffer packing to support nulls and improve performance (#426)

First, this PR fixes the rather uninformative error that occurs on any
error while building an Array (closes #423). The error is now:

```python
import nanoarrow as na
na.Array([1, 2, 3])
#> ValueError
#> ...
#> An error occurred whilst converting object of type list to nanoarrow.c_array_stream or nanoarrow.c_array: 
#> schema is required for CArray import from iterable
```

Second, this PR adds support for `None` in iterables. This makes it much
more convenient to create arrays with nulls (closes #424).

```python
import nanoarrow as na
na.Array([1, 2, None, 4], na.int32())
#> nanoarrow.Array<int32>[4]
#> 1
#> 2
#> None
#> 4 
```

Finally, this PR tweaks the implementation of packing an iterable into a
buffer to avoid the very bad performance that existed previously. The
optimizations added were:

- The `CBufferBuilder` now implements the buffer protocol (so that we
can use `pack_into`)
- The `__len__` attribute is checked to preallocate where possible

Those optimizations resulted in a ~2x improvement over the previous
code; however, the types that can use the `array` constructor have the
biggest wins (5-6x improvement).

An example with the biggest gain:

```python
import numpy as np
import nanoarrow as na
import pyarrow as pa

floats = np.random.random(int(1e6))
floats_lst = list(floats)

%timeit pa.array(floats, pa.float64())
#> 1.79 µs ± 9.27 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit pa.array(floats_lst, pa.float64())
#> 13.8 ms ± 35.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pa.array(iter(floats_lst), pa.float64())
#> 17.9 ms ± 37.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit na.c_array(floats, na.float64())
#> 5.51 µs ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit na.c_array(floats_lst, na.float64(nullable=False))
#> 16.5 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit na.c_array(iter(floats_lst), na.float64(nullable=False))
#> 29.1 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(floats_lst, na.float64())
#> 43.6 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(iter(floats_lst), na.float64())
#> 43 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Before this PR:

```python
%timeit na.c_array(floats, na.float64())
#> 5.66 µs ± 44.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit na.c_array(floats_lst, na.float64())
#> 104 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(iter(floats_lst), na.float64())
#> 107 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

It should be noted that there is probably one more PR on top of this to
support building variable-length string/binary arrays (and possibly move
some of the building code out of `c_lib.py` since it is getting a little
crowded there).

5 files changed

tree: baa24548a6be09b371cd580cebd2cd78d329e885

README.md

nanoarrow

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanoarrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}

Building with Meson

CMake is the officially supported build system for nanoarrow. However, the Meson backend is an experimental feature you may also wish to try.

To run the test suite with Meson, you will want to first install the testing dependencies via the wrap database (n.b. no wrap database entry exists for Arrow - that must be installed separately).

mkdir subprojects
meson wrap install gtest
meson wrap install google-benchmark
meson wrap install nlohmann_json

The Arrow C++ library must also be discoverable via pkg-config build tests.

You can then set up your build directory:

meson setup builddir
cd builddir

And configure your project (this could have also been done inline with setup)

meson configure -DNANOARROW_BUILD_TESTS=true -DNANOARROW_BUILD_BENCHMARKS=true

Note that if your Arrow pkg-config profile is installed in a non-standard location on your system, you may pass the --pkg-config-path <path to directory with arrow.pc> to either the setup or configure steps above.

With the above out of the way, the compile command should take care of the rest:

meson compile

Upon a successful build you can execute the test suite and benchmarks with the following commands:

meson test nanoarrow:  # default test run
meson test nanoarrow: --wrap valgrind  # run tests under valgrind
meson test nanoarrow: --benchmark --verbose # run benchmarks