refactor(python): Document, prefix, and add reprs for C-wrapping classes (#340)

This PR was inspired #319 but only addresses the first half (prefixes
C-wrapping classes so that the name `nanoarrow.array()` can be used for
a future class/constructor that more closely resembles a `pyarrow.Array`
or `numpy.Array`.

This PR does a few things:

- Uses capsules to manage allocate/cleanup of C resources instead of
"holder" objects. This eliminated some code and in theory makes it
possible to move some pieces out of Cython into C.
- Renames any "nanoarrow C library binding" classes to start with `C`
(e.g., `Schema` to `CSchema`). I made them slightly more literal as
well. Basically, these classes are about accessing the fields of the
structure without segfaulting. In a potential future world where we
don't use Cython, this is something like what we'd get with
auto-generated wrapper classes or thin C++ wrappers with generated
binding code.
- Opens the door for the user-facing versions of these: `Array`,
`Schema`, and an `ArrayStream`. The scope and design of those requires
more iteration than this PR allows and would benefit from some other
infrastructure to be in place first (e.g., convert to/from Python)

To make it a little more clear what the existing structures actually are
and what they can do, I added `repr()`s for them and updated the README.
Briefly:

```python
import nanoarrow as na
import pyarrow as pa

na.cschema(pa.int32())
#> <nanoarrow.clib.CSchema int32>
#> - format: 'i'
#> - name: ''
#> - flags: 2
#> - metadata: NULL
#> - dictionary: NULL
#> - children[0]:

na.cschema_view(pa.timestamp('s', "America/Halifax"))
#> <nanoarrow.clib.CSchemaView>
#> - type: 'timestamp'
#> - storage_type: 'int64'
#> - time_unit: 's'
#> - timezone: 'America/Halifax'

na.carray(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArray int64>
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers: (0, 3354772373824)
#> - dictionary: NULL
#> - children[0]:

na.carray_view(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArrayView>
#> - storage_type: 'int64'
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers[2]:
#>   - <bool validity[0 b] >
#>   - <int64 data[24 b] 1 2 3>
#> - dictionary: NULL
#> - children[0]:

pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
na.carray_stream(reader)
#> <nanoarrow.clib.CArrayStream>
#> - get_schema(): struct<some_column: int32>
```

This involved fixing the existing `BufferView` since to print their
contents in a repr-friendly way the elements had to be accessed. I think
the `BufferView` will see some changes but it does seem relatively
performant:

```python
import pyarrow as pa
import nanoarrow as na
import numpy as np

n = int(1e6)
pa_array = pa.array(np.random.random(n))
na_array_view = na.carray_view(pa_array)

%timeit pa_array.to_pylist()
#> 169 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit %timeit list(na_array_view.buffer(1))
#> 33.8 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

---------

Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>
13 files changed
tree: f4304f01896064e57103e4d23b362c72e2e29557
  1. .github/
  2. ci/
  3. dev/
  4. dist/
  5. docs/
  6. examples/
  7. extensions/
  8. python/
  9. r/
  10. src/
  11. .asf.yaml
  12. .clang-format
  13. .cmake-format
  14. .env
  15. .flake8
  16. .gitattributes
  17. .gitignore
  18. .isort.cfg
  19. .pre-commit-config.yaml
  20. CHANGELOG.md
  21. CMakeLists.txt
  22. CMakePresets.json
  23. CMakeUserPresets.json.example
  24. docker-compose.yml
  25. LICENSE.txt
  26. NOTICE.txt
  27. README.md
  28. valgrind.supp
README.md

nanoarrow

Codecov test coverage Documentation nanoarrow on GitHub

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}