commit	841c845547d7abb7eb11aa01a651175309293c03	[log] [tgz]
author	Dewey Dunnington <dewey@dunnington.ca>	Mon Feb 19 15:01:38 2024 -0400
committer	GitHub <noreply@github.com>	Mon Feb 19 15:01:38 2024 -0400
tree	71345efb18a32f5e7073983f495d6a2d5a0d1bf4
parent	4b6717fc7e0161a366207ead5e9a30fbeca7fada [diff]

feat(python): Add array creation/building from buffers (#378)

The gist of this PR is that I'd like the ability to create arrays for
testing without pyarrow so that nanoarrow's tests can run in more
places. Other than building/running in odd corner-case environments,
nanoarrow in R has been great at prototyping and/or creating test data
(e.g., an array with a non-zero offset, an array with a rarely-used
type). This is useful for both nanoarrow to test itself and perhaps
others who might want to use nanoarrow in a similar way in Python.

This is a bit big...I did need to put all of it in one place to figure
out what the end point was; however, I'm happy to split into smaller
self-contained bits now that I know where I'm headed.

After this PR, we can create an array out-of-the-box from anything that
supports the buffer protocol. Importantly, this includes numpy arrays so
that you can do things like generate arrays with `n` random numbers.


```python
import nanoarrow as na
import numpy as np
```

```python
na.c_array_view(b"12345")
```




    <nanoarrow.c_lib.CArrayView>
    - storage_type: 'uint8'
    - length: 5
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <uint8[5 b] 49 50 51 52 53>
    - dictionary: NULL
    - children[0]:


```python
na.c_array_view(np.array([1, 2, 3], np.int32))
```

```
<nanoarrow.c_lib.CArrayView>
- storage_type: 'int32'
- length: 3
- offset: 0
- null_count: 0
- buffers[2]:
  - validity <bool[0 b] >
  - data <int32[12 b] 1 2 3>
- dictionary: NULL
- children[0]:
```

While not built in to the main `c_array()` constructor, we can also now
assemble an array from buffers. This has been very useful in R and
ensures that we can construct just about any array if we need to.


```python
array = na.c_array_from_buffers(
    na.struct([na.int32()]),
    length=3,
    buffers=[None],
    children=[
        na.c_array_from_buffers(
            na.int32(),
            length=3,
            buffers=[None, na.c_buffer([1, 2, 3], na.int32())]
        )
    ],
)

na.c_array_view(array)
```




    <nanoarrow.c_lib.CArrayView>
    - storage_type: 'struct'
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers[1]:
      - validity <bool[0 b] >
    - dictionary: NULL
    - children[1]:
      - <nanoarrow.c_lib.CArrayView>
        - storage_type: 'int32'
        - length: 3
        - offset: 0
        - null_count: 0
        - buffers[2]:
          - validity <bool[0 b] >
          - data <int32[12 b] 1 2 3>
        - dictionary: NULL
        - children[0]:



I also added the ability to construct a buffer from an iterable and
wired that into the `c_array()` constructor although this is probably
not all that fast. It does, however, make it much easier to write tests
(because many of them currently start with `na_c_array(pa.array([1, 2,
3]))`.


```python
na.c_array_view([1, 2, 3], na.int32())
```




    <nanoarrow.c_lib.CArrayView>
    - storage_type: 'int32'
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int32[12 b] 1 2 3>
    - dictionary: NULL
    - children[0]:



This allows creating an array from anything supported by the `struct`
module which means we can create some of the less frequently used types.


```python
na.c_array_view([1, 2, 3], na.float16())
```




    CBuffer(half_float[6 b] 1.0 2.0 3.0)




```python
na.c_array_view([(1, 2), (3, 4), (5, 6)], na.interval_day_time())
```




    CBuffer(interval_day_time[24 b] (1, 2) (3, 4) (5, 6))



Because it's mentaly exhausting to bitpack buffers in my head and
because Arrow uses them all the time, I also think it's mission-critical
to be able to create bitmaps:


```python
na.c_buffer([True, False, True, True], na.bool())
```




    CBuffer(bool[1 b] 10110000)


This involved fixing some issues with the existing buffer view:

- The buffer view only ever saved a pointer to the device. This is a bit
of a problem because even though the CPU device is static and lives
forever, CUDA "device" objects will probably keep a CUDA context alive.
Thus, we need a strong reference to the `CDevice` Python object (which
ensures the underlying nanoarrow `Device*` remains valid).
- The buffer view only handled `BufferView` input where technically all
it needs is a pointer and a length. This opens it up to represent other
types of buffers than just something from nanoarrow (e.g., imported from
dlpack or buffer protocol).

Implementing the buffer protocol as a consumer was done by wrapping the
`ArrowBuffer` with a "deallocator" that holds the `Py_buffer` and
ensures it is released. I still need to do some testing to ensure that
it's actually released and that we're not leaking memory. This is how I
do it in R and in geoarrow-c (Python) as well. Using the `ArrowBuffer`
is helpful because the C-level array builder uses them to manage the
memory and ensures they're all released when the array is released.

Implementing the build-from-iterable involved a few more
things...notably, completing the "python struct format string" <->
"arrow data type" conversion. This allows the use of `struct.pack()`
which takes care of things like half-float conversion and tuples of day,
month, nano conversion.

I'm aware this could use a bit better documentation of the added
classes/methods...I am assuming these will be internal for the time
being but they definitely need a bit more than is currently there.

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

17 files changed

tree: 71345efb18a32f5e7073983f495d6a2d5a0d1bf4

README.md

nanoarrow

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}