commit	d1b9924fe74fb68096cdc209d0341a1e163b570e	[log] [tgz]
author	Dewey Dunnington <dewey@dunnington.ca>	Fri May 17 16:10:43 2024 -0300
committer	GitHub <noreply@github.com>	Fri May 17 16:10:43 2024 -0300
tree	f76c024200d2f67cd08729d43d92dba6ecd60a61
parent	aebc81248c0423a535640498e3f26a6de69e7eca [diff]

feat(python): Add column-wise buffer builder (#464)

This PR implements building columns buffer-wise for the types where this
makes sense. It also implements a few other changes:

- `item_size` was renamed to `itemsize` to match the memoryview property
name
- The visitor methods are now the `ArrayViewVisitable` mixin such that
they are available in both the `Array` and `ArrayView` without
duplicating documentation.

Functionally this means that the `Array` and `ArrayStream` now have
`to_column()` and `to_column_list()` methods that do something that more
closely matches what somebody would expect.

A quick demo:

```python
import nanoarrow as na
import pyarrow as pa

batch = pa.record_batch({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
batch_with_nulls = pa.record_batch({"col1": [1, None, 3], "col2": ["a", "b", None]})

# Either builds a buffer or a list depending on column types
na.Array(batch).to_columns_pysequence()
#> (['col1', 'col2'],
#>  [nanoarrow.c_lib.CBuffer(int64[24 b] 1 2 3), ['a', 'b', 'c']])

# One can inject a null handler (a few experimental ones provided)
na.Array(batch_with_nulls).to_columns_pysequence(handle_nulls=na.nulls_as_sentinel())
#> (['col1', 'col2'], [array([ 1., nan,  3.]), ['a', 'b', None]])

# ...by default you have to choose how to do this or we error
na.Array(batch_with_nulls).to_columns_pysequence()
#> ValueError: Null present with null_handler=nulls_forbid()
```

This will basically get you data frame conversion:

```python
import nanoarrow as na
import pandas as pd

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
names, data = na.ArrayStream.from_url(url).to_columns_pysequence(handle_nulls=na.nulls_as_sentinel())
pd.DataFrame({k: v for k, v in zip(names, data)})
#>                                          commit                      time  \
#> 0      49cdb0fe4e98fda19031c864a18e6156c6edbf3c 2024-03-07 02:00:52+00:00   
#> 1      1d966e98e41ce817d1f8c5159c0b9caa4de75816 2024-03-06 21:51:34+00:00   
#> 2      96f26a89bd73997f7532643cdb27d04b70971530 2024-03-06 20:29:15+00:00   
#> 3      ee1a8c39a55f3543a82fed900dadca791f6e9f88 2024-03-06 07:46:45+00:00   
#> 4      3d467ac7bfae03cf2db09807054c5672e1959aec 2024-03-05 16:13:32+00:00   
#> ...                                         ...                       ...   
#> 15482  23c4b08d154f8079806a1f0258d7e4af17bdf5fd 2016-02-17 12:39:03+00:00   
#> 15483  16e44e3d456219c48595142d0a6814c9c950d30c 2016-02-17 12:38:39+00:00   
#> 15484  fa5f0299f046c46e1b2f671e5e3b4f1956522711 2016-02-17 12:38:39+00:00   
#> 15485  cbc56bf8ac423c585c782d5eda5c517ea8df8e3c 2016-02-17 12:38:39+00:00   
#> 15486  d5aa7c46692474376a3c31704cfc4783c86338f2 2016-02-05 20:08:35+00:00   
#> 
#>        files  merge                                            message  
#> 0          2  False  GH-40370: [C++] Define ARROW_FORCE_INLINE for ...  
#> 1          1  False     GH-40386: [Python] Fix except clauses (#40387)  
#> 2          1  False  GH-40227: [R] ensure executable files in `crea...  
#> 3          1  False  GH-40366: [C++] Remove const qualifier from Bu...  
#> 4          1  False  GH-20127: [Python][CI] Remove legacy hdfs test...  
#> ...      ...    ...                                                ...  
#> 15482     73  False  ARROW-4: This provides an partial C++11 implem...  
#> 15483      8  False  ARROW-3: This patch includes a WIP draft speci...  
#> 15484    124  False                 ARROW-1: Initial Arrow Code Commit  
#> 15485      2  False             Update readme and add license in root.  
#> 15486      1  False                                     Initial Commit  
#> 
#> [15487 rows x 5 columns]
```

---------

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

11 files changed

tree: f76c024200d2f67cd08729d43d92dba6ecd60a61

README.md

nanoarrow

The nanoarrow libraries are a set of helpers to produce and consume Arrow data, including the Arrow C Data, Arrow C Stream, and Arrow C Device, structures and the serialized Arrow IPC format. The vision of nanoarrow is that it should be trivial for libraries to produce and consume Arrow data: it helps fulfill this vision by providing high-quality, easy-to-adopt helpers to produce, consume, and test Arrow data types and arrays.

The nanoarrow libraries were built to be:

Small: nanoarrow’s C runtime compiles into a few hundred kilobytes and its R and Python bindings both have an installed size of ~1 MB.
Easy to depend on: nanoarrow's C library is distributed as two files (nanoarrow.c and nanoarrow.h) and its R and Python bindings have zero dependencies.
Useful: The Arrow Columnar Format includes a wide range of data type and data encoding options. To the greatest extent practicable, nanoarrow strives to support the entire Arrow columnar specification (see the Arrow implementation status page for implementation status).

Getting started

The nanoarrow Python bindings are available from PyPI and conda-forge:

pip install nanoarrow
conda install nanoarrow -c conda-forge

The nanoarrow R package is available from CRAN:

install.packages("nanoarrow")

See the nanoarrow Documentation for extended tutorials and API reference for the C, C++, Python, and R libraries.

The nanoarrow GitHub repository additionally provides a number of examples covering how to use nanoarrow in a variety of build configurations.

Development

Building with CMake

CMake is the primary build system used to develop and test the nanoarrow C library. You can build nanoarrow with:

mkdir build && cd build
cmake ..
cmake --build .

Building nanoarrow with tests currently requires Arrow C++. If installed via a system package manager like apt, dnf, or brew, the tests can be built with:

mkdir build && cd build
cmake .. -DNANOARROW_BUILD_TESTS=ON
cmake --build .

Tests can be run with ctest.

Building with Meson

CMake is the officially supported build system for nanoarrow. However, the Meson backend is an experimental feature you may also wish to try.

To run the test suite with Meson, you will want to first install the testing dependencies via the wrap database (n.b. no wrap database entry exists for Arrow - that must be installed separately).

mkdir subprojects
meson wrap install gtest
meson wrap install google-benchmark
meson wrap install nlohmann_json

The Arrow C++ library must also be discoverable via pkg-config build tests.

You can then set up your build directory:

meson setup builddir
cd builddir

And configure your project (this could have also been done inline with setup)

meson configure -DNANOARROW_BUILD_TESTS=true -DNANOARROW_BUILD_BENCHMARKS=true

Note that if your Arrow pkg-config profile is installed in a non-standard location on your system, you may pass the --pkg-config-path <path to directory with arrow.pc> to either the setup or configure steps above.

With the above out of the way, the compile command should take care of the rest:

meson compile

Upon a successful build you can execute the test suite and benchmarks with the following commands:

meson test nanoarrow:  # default test run
meson test nanoarrow: --wrap valgrind  # run tests under valgrind
meson test nanoarrow: --benchmark --verbose # run benchmarks