commit	c66ddc35a9ccf0374aadc2d3a8821431ed0c9ca6	[log] [tgz]
author	Dewey Dunnington <dewey@dunnington.ca>	Wed Feb 21 13:14:40 2024 -0400
committer	GitHub <noreply@github.com>	Wed Feb 21 13:14:40 2024 -0400
tree	2fa0602c472dc0fd83995ab8b3f6cc288e3a0df0
parent	97a34e49e732f2033baf34a711e285fda2ceb31a [diff]

feat(python): Add bindings for IPC reader (#388)

This PR adds bindings to nanoarrow's (high-ish level) IPC reader. There
are some lower-level concepts that might be nice to expose at some point
but the `ArrowArrayStream` implementation lets users realize most of the
benefit.

I am envisioning that there will be a higher level `ArrayStream` class
than the `CArrayStream`, so I am not sure that the current interface
(create a `Stream`, then use `na.c_array_stream()`) will be the one
users actually use to access this. Probably something more like
`na.ArrayStream.from_stream([path or url or file])` would be
appropriate. I held off on writing the Examples section until the first
round of review in case there's a better way to go about this!

nanoarrow's reader is not as fast as pyarrow's using its internal
filesystem (which might do memory mapping and avoids more copies), but
is exactly as fast as pyarrow's reader when using a file-like object.
See also #390, which implements basically the same interface in R.

```python
import numpy as np
import pyarrow as pa
from pyarrow import ipc
import nanoarrow as na
from nanoarrow.ipc import Stream

n = int(1e6)
n_cols = 10
tab = pa.table(
    [np.random.random(n) for _ in range(n_cols)],
    names=[f"col{i}" for i in range(n_cols)]
)

with ipc.new_stream("file.arrows", tab.schema) as stream:
    stream.write_table(tab)

def read_pyarrow():
    with ipc.open_stream("file.arrows") as stream:
        return list(stream)

%timeit len(read_pyarrow())
#> 1.19 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

def read_pyarrow_pyobj():
    with open("file.arrows", "rb") as f, ipc.open_stream(f) as stream:
        return list(stream)

%timeit len(read_pyarrow_pyobj())
#> 11.5 ms ± 173 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

def read_nanoarrow():
    with Stream.from_path("file.arrows") as input:
        stream = na.c_array_stream(input)
        return list(stream)

%timeit len(read_nanoarrow())
#> 11.4 ms ± 56.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>

.github/workflows/python-wheels.yaml[diff]
python/.gitignore[diff]
python/MANIFEST.in[diff]
python/bootstrap.py[diff]
python/setup.py[diff]
python/src/nanoarrow/_ipc_lib.pyx[Added - diff]
python/src/nanoarrow/_lib.pyx[diff]
python/src/nanoarrow/_repr_utils.py[Renamed from python/src/nanoarrow/_lib_utils.py - diff]
python/src/nanoarrow/c_lib.py[diff]
python/src/nanoarrow/ipc.py[Added - diff]
python/tests/test_c_buffer.py[diff]
python/tests/test_ipc.py[Added - diff]
python/tests/test_nanoarrow.py[diff]

13 files changed

tree: 2fa0602c472dc0fd83995ab8b3f6cc288e3a0df0

README.md

nanoarrow

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}