commit | d1b9924fe74fb68096cdc209d0341a1e163b570e | [log] [tgz] |
---|---|---|
author | Dewey Dunnington <dewey@dunnington.ca> | Fri May 17 16:10:43 2024 -0300 |
committer | GitHub <noreply@github.com> | Fri May 17 16:10:43 2024 -0300 |
tree | f76c024200d2f67cd08729d43d92dba6ecd60a61 | |
parent | aebc81248c0423a535640498e3f26a6de69e7eca [diff] |
feat(python): Add column-wise buffer builder (#464) This PR implements building columns buffer-wise for the types where this makes sense. It also implements a few other changes: - `item_size` was renamed to `itemsize` to match the memoryview property name - The visitor methods are now the `ArrayViewVisitable` mixin such that they are available in both the `Array` and `ArrayView` without duplicating documentation. Functionally this means that the `Array` and `ArrayStream` now have `to_column()` and `to_column_list()` methods that do something that more closely matches what somebody would expect. A quick demo: ```python import nanoarrow as na import pyarrow as pa batch = pa.record_batch({"col1": [1, 2, 3], "col2": ["a", "b", "c"]}) batch_with_nulls = pa.record_batch({"col1": [1, None, 3], "col2": ["a", "b", None]}) # Either builds a buffer or a list depending on column types na.Array(batch).to_columns_pysequence() #> (['col1', 'col2'], #> [nanoarrow.c_lib.CBuffer(int64[24 b] 1 2 3), ['a', 'b', 'c']]) # One can inject a null handler (a few experimental ones provided) na.Array(batch_with_nulls).to_columns_pysequence(handle_nulls=na.nulls_as_sentinel()) #> (['col1', 'col2'], [array([ 1., nan, 3.]), ['a', 'b', None]]) # ...by default you have to choose how to do this or we error na.Array(batch_with_nulls).to_columns_pysequence() #> ValueError: Null present with null_handler=nulls_forbid() ``` This will basically get you data frame conversion: ```python import nanoarrow as na import pandas as pd url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows" names, data = na.ArrayStream.from_url(url).to_columns_pysequence(handle_nulls=na.nulls_as_sentinel()) pd.DataFrame({k: v for k, v in zip(names, data)}) #> commit time \ #> 0 49cdb0fe4e98fda19031c864a18e6156c6edbf3c 2024-03-07 02:00:52+00:00 #> 1 1d966e98e41ce817d1f8c5159c0b9caa4de75816 2024-03-06 21:51:34+00:00 #> 2 96f26a89bd73997f7532643cdb27d04b70971530 2024-03-06 20:29:15+00:00 #> 3 ee1a8c39a55f3543a82fed900dadca791f6e9f88 2024-03-06 07:46:45+00:00 #> 4 3d467ac7bfae03cf2db09807054c5672e1959aec 2024-03-05 16:13:32+00:00 #> ... ... ... #> 15482 23c4b08d154f8079806a1f0258d7e4af17bdf5fd 2016-02-17 12:39:03+00:00 #> 15483 16e44e3d456219c48595142d0a6814c9c950d30c 2016-02-17 12:38:39+00:00 #> 15484 fa5f0299f046c46e1b2f671e5e3b4f1956522711 2016-02-17 12:38:39+00:00 #> 15485 cbc56bf8ac423c585c782d5eda5c517ea8df8e3c 2016-02-17 12:38:39+00:00 #> 15486 d5aa7c46692474376a3c31704cfc4783c86338f2 2016-02-05 20:08:35+00:00 #> #> files merge message #> 0 2 False GH-40370: [C++] Define ARROW_FORCE_INLINE for ... #> 1 1 False GH-40386: [Python] Fix except clauses (#40387) #> 2 1 False GH-40227: [R] ensure executable files in `crea... #> 3 1 False GH-40366: [C++] Remove const qualifier from Bu... #> 4 1 False GH-20127: [Python][CI] Remove legacy hdfs test... #> ... ... ... ... #> 15482 73 False ARROW-4: This provides an partial C++11 implem... #> 15483 8 False ARROW-3: This patch includes a WIP draft speci... #> 15484 124 False ARROW-1: Initial Arrow Code Commit #> 15485 2 False Update readme and add license in root. #> 15486 1 False Initial Commit #> #> [15487 rows x 5 columns] ``` --------- Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
The nanoarrow libraries are a set of helpers to produce and consume Arrow data, including the Arrow C Data, Arrow C Stream, and Arrow C Device, structures and the serialized Arrow IPC format. The vision of nanoarrow is that it should be trivial for libraries to produce and consume Arrow data: it helps fulfill this vision by providing high-quality, easy-to-adopt helpers to produce, consume, and test Arrow data types and arrays.
The nanoarrow libraries were built to be:
The nanoarrow Python bindings are available from PyPI and conda-forge:
pip install nanoarrow conda install nanoarrow -c conda-forge
The nanoarrow R package is available from CRAN:
install.packages("nanoarrow")
See the nanoarrow Documentation for extended tutorials and API reference for the C, C++, Python, and R libraries.
The nanoarrow GitHub repository additionally provides a number of examples covering how to use nanoarrow in a variety of build configurations.
CMake is the primary build system used to develop and test the nanoarrow C library. You can build nanoarrow with:
mkdir build && cd build cmake .. cmake --build .
Building nanoarrow with tests currently requires Arrow C++. If installed via a system package manager like apt
, dnf
, or brew
, the tests can be built with:
mkdir build && cd build cmake .. -DNANOARROW_BUILD_TESTS=ON cmake --build .
Tests can be run with ctest
.
CMake is the officially supported build system for nanoarrow. However, the Meson backend is an experimental feature you may also wish to try.
To run the test suite with Meson, you will want to first install the testing dependencies via the wrap database (n.b. no wrap database entry exists for Arrow - that must be installed separately).
mkdir subprojects meson wrap install gtest meson wrap install google-benchmark meson wrap install nlohmann_json
The Arrow C++ library must also be discoverable via pkg-config build tests.
You can then set up your build directory:
meson setup builddir
cd builddir
And configure your project (this could have also been done inline with setup
)
meson configure -DNANOARROW_BUILD_TESTS=true -DNANOARROW_BUILD_BENCHMARKS=true
Note that if your Arrow pkg-config profile is installed in a non-standard location on your system, you may pass the --pkg-config-path <path to directory with arrow.pc>
to either the setup or configure steps above.
With the above out of the way, the compile
command should take care of the rest:
meson compile
Upon a successful build you can execute the test suite and benchmarks with the following commands:
meson test nanoarrow: # default test run meson test nanoarrow: --wrap valgrind # run tests under valgrind meson test nanoarrow: --benchmark --verbose # run benchmarks