tree f4304f01896064e57103e4d23b362c72e2e29557
parent b636a8f88d6a529ff31da6847ac44b584847e5f6
author Dewey Dunnington <dewey@dunnington.ca> 1705000644 +0000
committer GitHub <noreply@github.com> 1705000644 -0400
gpgsig -----BEGIN PGP SIGNATURE-----
 
 wsBcBAABCAAQBQJloD7ECRBK7hj4Ov3rIwAAARAIAGHRmjF4d0WpbME9Kpeuoxx9
 +KYiG27kYaMCjQ9G5+dVDJ9IO2rQxyZWOyQ3/0tE03CELnf9d+UH2IYBlfYSnAHA
 +Do8/mwGPhxk0L4unLx2tGdb0Nzwkb9h4zlQvICc0JhwAM71VfHs7BLPUN4NJC+H
 C8+KM9KKftiJGtkwUBmDdohivIBPA1uND2Xq0IMSS15ozsgWHFnko/uhoK1i19fs
 mH+9PxjFZbK51AaupK0AmgiYfM3fWK3iJLO/NB+u4eDb3NmCq1V9Z7MmVAuhWu5b
 dKZ/Vr21tMsHJmosTOTjUUC76sIt7kZhFrmuSpZPbTpR9qXPmUCTcl0LsTpEHsU=
 =U+Bi
 -----END PGP SIGNATURE-----
 

refactor(python): Document, prefix, and add reprs for C-wrapping classes (#340)

This PR was inspired #319 but only addresses the first half (prefixes
C-wrapping classes so that the name `nanoarrow.array()` can be used for
a future class/constructor that more closely resembles a `pyarrow.Array`
or `numpy.Array`.

This PR does a few things:

- Uses capsules to manage allocate/cleanup of C resources instead of
"holder" objects. This eliminated some code and in theory makes it
possible to move some pieces out of Cython into C.
- Renames any "nanoarrow C library binding" classes to start with `C`
(e.g., `Schema` to `CSchema`). I made them slightly more literal as
well. Basically, these classes are about accessing the fields of the
structure without segfaulting. In a potential future world where we
don't use Cython, this is something like what we'd get with
auto-generated wrapper classes or thin C++ wrappers with generated
binding code.
- Opens the door for the user-facing versions of these: `Array`,
`Schema`, and an `ArrayStream`. The scope and design of those requires
more iteration than this PR allows and would benefit from some other
infrastructure to be in place first (e.g., convert to/from Python)

To make it a little more clear what the existing structures actually are
and what they can do, I added `repr()`s for them and updated the README.
Briefly:

```python
import nanoarrow as na
import pyarrow as pa

na.cschema(pa.int32())
#> <nanoarrow.clib.CSchema int32>
#> - format: 'i'
#> - name: ''
#> - flags: 2
#> - metadata: NULL
#> - dictionary: NULL
#> - children[0]:

na.cschema_view(pa.timestamp('s', "America/Halifax"))
#> <nanoarrow.clib.CSchemaView>
#> - type: 'timestamp'
#> - storage_type: 'int64'
#> - time_unit: 's'
#> - timezone: 'America/Halifax'

na.carray(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArray int64>
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers: (0, 3354772373824)
#> - dictionary: NULL
#> - children[0]:

na.carray_view(pa.array([1, 2, 3]))
#> <nanoarrow.clib.CArrayView>
#> - storage_type: 'int64'
#> - length: 3
#> - offset: 0
#> - null_count: 0
#> - buffers[2]:
#>   - <bool validity[0 b] >
#>   - <int64 data[24 b] 1 2 3>
#> - dictionary: NULL
#> - children[0]:

pa_array_child = pa.array([1, 2, 3], pa.int32())
pa_array = pa.record_batch([pa_array_child], names=["some_column"])
reader = pa.RecordBatchReader.from_batches(pa_array.schema, [pa_array])
na.carray_stream(reader)
#> <nanoarrow.clib.CArrayStream>
#> - get_schema(): struct<some_column: int32>
```

This involved fixing the existing `BufferView` since to print their
contents in a repr-friendly way the elements had to be accessed. I think
the `BufferView` will see some changes but it does seem relatively
performant:

```python
import pyarrow as pa
import nanoarrow as na
import numpy as np

n = int(1e6)
pa_array = pa.array(np.random.random(n))
na_array_view = na.carray_view(pa_array)

%timeit pa_array.to_pylist()
#> 169 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit %timeit list(na_array_view.buffer(1))
#> 33.8 ms ± 340 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

---------

Co-authored-by: Dane Pitkin <48041712+danepitkin@users.noreply.github.com>