commit	7af6dfff2a194a9b0b599a4fd7ed9d4f3ff5f650	[log] [tgz]
author	Dewey Dunnington <dewey@dunnington.ca>	Thu Mar 21 16:00:29 2024 -0300
committer	GitHub <noreply@github.com>	Thu Mar 21 16:00:29 2024 -0300
tree	7502ccfe202ef24d7a8bf5f3bd06b040b73cb942
parent	dc50114756b7e9067b42181a6a86f928effc6e68 [diff]

feat(python): Add user-facing `Array` class (#396)

This PR implements the `nanoarrow.Array` which basically a
`pyarrow.ChunkedArray`. This can represent a `Table`, `RecordBatch`,
`ChunkedArray`, and `Array`. It doesn't quite play nicely with pyarrow's
ChunkedArray (but will after the next release, since
`__arrow_c_stream__` was just added there).

The user-facing class is backed by a Cython class, the
`CMaterializedArrayStream`, which manages some of the c-level details
like resolving a chunk + offset when there is more than one chunk in the
array. An early version of this PR implemented the
`CMaterializedArrayStream` using C pointers (e.g., `ArrowArray*
arrays`), but I decided that was to complex and went back to
`List[CArray]`. I think this is also better for managing ownership
(e.g., more unneeded `CArray` instances can be released by the garbage
collector).

The `Array` class as implemented here is device-aware, although until we
have non-CPU support it's difficult to test this. The methods I added
here are basically stubs just to demonstrate the intention.

This PR also implements the `Scalar`, whose main purpose for testing and
other non-performance sensitive things (like lazier reprs for very large
items or interactive inspection). They may also be useful for working
with arrays that contain elements with very long strings or large arrays
(e.g., geometry).

I also added some basic accessors like `buffer()`, `child()`, and some
ways one might want to iterate over an `Array` to make the utility of
this class more clear.

Basic usage:

```python
import nanoarrow as na

na.Array(range(100), na.int64())
```

```
nanoarrow.Array<int64>[100]
0
1
2
3
4
5
6
7
8
9
...and 90 more items
```

More involved example reading from an IPC stream:

```python
import nanoarrow as na
from nanoarrow.ipc import Stream

url = "https://github.com/apache/arrow-testing/raw/master/data/arrow-ipc-stream/integration/1.0.0-littleendian/generated_primitive.stream"

with Stream.from_url(url) as inp:
    array = na.Array(inp)

array.child(25)
```

```
nanoarrow.Array<string>[37]
'co矢2p矢m'
'wÂ€acrd'
'kjd1dlô'
'pib矢d5w'
'6nnpwôg'
'ndj£h£4'
'ôôf4aµg'
'kwÂh£fr'
'°g5dk€e'
'r€cbmdn'
...and 27 more items
```

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

13 files changed

tree: 7502ccfe202ef24d7a8bf5f3bd06b040b73cb942

README.md

nanoarrow

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}