commit	72a2e67bb3476a926026f128b622e89881204ae9	[log] [tgz]
author	Dewey Dunnington <dewey@dunnington.ca>	Thu Jan 11 19:17:24 2024 +0000
committer	GitHub <noreply@github.com>	Thu Jan 11 15:17:24 2024 -0400
tree	f4304f01896064e57103e4d23b362c72e2e29557
parent	b636a8f88d6a529ff31da6847ac44b584847e5f6 [diff]

refactor(python): Document, prefix, and add reprs for C-wrapping classes (#340)

This PR was inspired #319 but only addresses the first half (prefixes
C-wrapping classes so that the name `nanoarrow.array()` can be used for
a future class/constructor that more closely resembles a `pyarrow.Array`
or `numpy.Array`.

This PR does a few things:

- Uses capsules to manage allocate/cleanup of C resources instead of
"holder" objects. This eliminated some code and in theory makes it
possible to move some pieces out of Cython into C.
- Renames any "nanoarrow C library binding" classes to start with `C`
(e.g., `Schema` to `CSchema`). I made them slightly more literal as
well. Basically, these classes are about accessing the fields of the
structure without segfaulting. In a potential future world where we
don't use Cython, this is something like what we'd get with
auto-generated wrapper classes or thin C++ wrappers with generated
binding code.
- Opens the door for the user-facing versions of these: `Array`,
`Schema`, and an `ArrayStream`. The scope and design of those requires
more iteration than this PR allows and would benefit from some other
infrastructure to be in place first (e.g., convert to/from Python)

To make it a little more clear what the existing structures actually are
and what they can do, I added `repr()`s for them and updated the README.
Briefly:

```python
import nanoarrow as na
import pyarrow as pa

na.cschema(pa.int32())
#> <nanoarrow.clib.CSchema int32>
#> - format: 'i'
#> - name: ''
#> - flags: 2
#> - metadata: NULL
#> - dictionary: NULL
#> - children[0]:

na.cschema_view(pa.timestamp('s', "America/Halifax"))
#> <nanoarrow.clib.CSchemaView>
#> - type: 'timestamp'
#> - storage_type: 'int64'
#> - time_unit: 's'
#> - timezone: 'America/Halifax'

na.carray(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArray int64>
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers: (0, 3354772373824)
#> - dictionary: NULL
#> - children[0]:

na.carray_view(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArrayView>
#> - storage_type: 'int64'
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers[2]:
#>   - <bool validity[0 b] >
#>   - <int64 data[24 b] 1 2 3>
#> - dictionary: NULL
#> - children[0]:

pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
na.carray_stream(reader)
#> <nanoarrow.clib.CArrayStream>
#> - get_schema(): struct<some_column: int32>
```

This involved fixing the existing `BufferView` since to print their
contents in a repr-friendly way the elements had to be accessed. I think
the `BufferView` will see some changes but it does seem relatively
performant:

```python
import pyarrow as pa
import nanoarrow as na
import numpy as np

n = int(1e6)
pa_array = pa.array(np.random.random(n))
na_array_view = na.carray_view(pa_array)

%timeit pa_array.to_pylist()
#> 169 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit %timeit list(na_array_view.buffer(1))
#> 33.8 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

---------

Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>

13 files changed

tree: f4304f01896064e57103e4d23b362c72e2e29557

README.md

nanoarrow

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}