chore(dev/benchmarks): Benchmark IPC reader (#405)

This PR adds IPC reader benchmarks via the ArrowArrayStream
implementation, which is the most common use. It required a little CMake
shuffling to get this to work plus some code to generate IPC stream
fixtures. The current benchmarks are:

- Many batches that are very tiny (from disk)
- Many batches that are very tiny (from a buffer)
- One very wide batch

Benchmark output in details:

<details>

# Benchmark Report

## Configurations

These benchmarks were run with the following configurations:

| preset_name | preset_description                               |
|:------------|:-------------------------------------------------|
| local       | Uses the nanoarrow C sources from this checkout. |
| v0.4.0      | Uses the nanoarrow C sources the 0.4.0 release.  |

## Summary

A quick and dirty summary of benchmark results between this checkout and
the last released version.

| benchmark_label | v0.4.0 | local | change | pct_change |

|:--------------------------------------------------------------|---------:|---------:|---------:|-----------:|
| [ArrayAppendInt16](#arrayappendint16) | 2.67ms | 2.67ms | 3.03µs |
0.1% |
| [ArrayAppendInt32](#arrayappendint32) | 3.13ms | 3.14ms | 18.1µs |
0.6% |
| [ArrayAppendInt64](#arrayappendint64) | 3.86ms | 3.44ms | 1ns | -10.7%
|
| [ArrayAppendInt8](#arrayappendint8) | 2.4ms | 2.41ms | 9.04µs | 0.4% |
| [ArrayAppendNulls](#arrayappendnulls) | 12.14ms | 12.13ms | 1ns |
-0.1% |
| [ArrayAppendString](#arrayappendstring) | 8.44ms | 8.79ms | 345.74µs |
4.1% |
| [ArrayViewGetInt16](#arrayviewgetint16) | 633.71µs | 631µs | 1ns |
-0.4% |
| [ArrayViewGetInt32](#arrayviewgetint32) | 627.51µs | 633.47µs | 5.96µs
| 1% |
| [ArrayViewGetInt64](#arrayviewgetint64) | 677.86µs | 680.66µs | 2.8µs
| 0.4% |
| [ArrayViewGetInt8](#arrayviewgetint8) | 945.2µs | 943.48µs | 1ns |
-0.2% |
| [ArrayViewGetString](#arrayviewgetstring) | 1.25ms | 1.26ms | 3.48µs |
0.3% |
| [ArrayViewIsNull](#arrayviewisnull) | 1.2ms | 1.19ms | 1ns | -0.4% |
| [ArrayViewIsNullNonNullable](#arrayviewisnullnonnullable) | 952.93µs |
941.49µs | 1ns | -1.2% |
| [IpcReadManyBatchesFromBuffer](#ipcreadmanybatchesfrombuffer) | 6.1ms
| 5.47ms | 1ns | -10.4% |
| [IpcReadManyBatchesFromFile](#ipcreadmanybatchesfromfile) | 6.99ms |
6.28ms | 1ns | -10.2% |
| [IpcReadManyColumnsFromFile](#ipcreadmanycolumnsfromfile) | 8.55ms |
8.67ms | 117.82µs | 1.4% |
| [SchemaInitWideStruct](#schemainitwidestruct) | 1.02ms | 1.02ms |
4.79µs | 0.5% |
| [SchemaViewInitWideStruct](#schemaviewinitwidestruct) | 104.11µs |
104.31µs | 198.06ns | 0.2% |

## ArrowArray-related benchmarks

Benchmarks for producing ArrowArrays using the ArrowArrayXXX()
functions.

### ArrayAppendString

Use ArrowArrayAppendString() to build a string array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L289-L314)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |         81 |    8.79ms |   8.76ms |      114,185,666 |
| v0.4.0      |         81 |    8.44ms |   8.41ms |      118,902,224 |

### ArrayAppendInt8

Use ArrowArrayAppendInt() to build an int8 array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L338-L340)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        294 |    2.41ms |    2.4ms |      415,884,526 |
| v0.4.0      |        291 |     2.4ms |    2.4ms |      417,592,258 |

### ArrayAppendInt16

Use ArrowArrayAppendInt() to build an int16 array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L343-L345)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        262 |    2.67ms |   2.67ms |      375,204,429 |
| v0.4.0      |        260 |    2.67ms |   2.66ms |      375,796,399 |

### ArrayAppendInt32

Use ArrowArrayAppendInt() to build an int32 array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L348-L350)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        224 |    3.14ms |   3.14ms |      318,552,407 |
| v0.4.0      |        223 |    3.13ms |   3.12ms |      320,376,521 |

### ArrayAppendInt64

Use ArrowArrayAppendInt() to build an int64 array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L353-L355)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        224 |    3.44ms |   3.43ms |      291,583,520 |
| v0.4.0      |        194 |    3.86ms |   3.82ms |      261,618,420 |

### ArrayAppendNulls

Use ArrowArrayAppendNulls() to build an int32 array that contains 80%
null values.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L378-L400)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |         57 |    12.1ms |   12.1ms |       82,528,559 |
| v0.4.0      |         57 |    12.1ms |   12.1ms |       82,489,624 |

## ArrowArrayView-related benchmarks

Benchmarks for consuming ArrowArrays using the ArrowArrayViewXXX()
functions.

### ArrayViewGetInt8

Use ArrowArrayViewGet() to consume an int8 array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L122-L124)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        742 |     943µs |    943µs |    1,060,960,927 |
| v0.4.0      |        745 |     945µs |    943µs |    1,059,924,880 |

### ArrayViewGetInt16

Use ArrowArrayViewGet() to consume an int16 array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L127-L129)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |       1115 |     631µs |    630µs |    1,586,526,901 |
| v0.4.0      |       1115 |     634µs |    633µs |    1,580,582,789 |

### ArrayViewGetInt32

Use ArrowArrayViewGet() to consume an int32 array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L132-L134)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |       1111 |     633µs |    633µs |    1,580,851,070 |
| v0.4.0      |       1115 |     628µs |    627µs |    1,596,033,249 |

### ArrayViewGetInt64

Use ArrowArrayViewGet() to consume an int64 array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L137-L139)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |       1029 |     681µs |    680µs |    1,471,227,424 |
| v0.4.0      |       1034 |     678µs |    677µs |    1,476,773,664 |

### ArrayViewIsNullNonNullable

Use ArrowArrayViewIsNull() to check for nulls while consuming an int32
array that does not contain a validity buffer.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L143-L172)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        746 |     941µs |    940µs |    1,063,268,768 |
| v0.4.0      |        735 |     953µs |    951µs |    1,051,845,237 |

### ArrayViewIsNull

Use ArrowArrayViewIsNull() to check for nulls while consuming an int32
array that contains 20% nulls.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L176-L215)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        589 |    1.19ms |   1.19ms |      838,600,093 |
| v0.4.0      |        579 |     1.2ms |    1.2ms |      835,530,389 |

### ArrayViewGetString

Use ArrowArrayViewGetStringUnsafe() to consume a string array.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/array_benchmark.cc#L218-L249)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        559 |    1.26ms |   1.25ms |      797,174,096 |
| v0.4.0      |        558 |    1.25ms |   1.25ms |      799,731,703 |

## IPC Reader Benchmarks

Benchmarks for the ArrowArrayStream IPC reader.

### IpcReadManyBatchesFromFile

Use the ArrowArrayStream IPC reader to read 10,000 batches with 5
elements each from a file.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/ipc_benchmark.cc#L93-L113)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        110 |    6.28ms |   6.26ms |        1,596,234 |
| v0.4.0      |        101 |    6.99ms |   6.98ms |        1,432,594 |

### IpcReadManyBatchesFromBuffer

Use the ArrowArrayStream IPC reader to read 10,000 batches with 5
elements each from a file.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/ipc_benchmark.cc#L117-L147)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        127 |    5.47ms |   5.46ms |        1,831,415 |
| v0.4.0      |        114 |     6.1ms |    6.1ms |        1,640,075 |

### IpcReadManyColumnsFromFile

Use the ArrowArrayStream IPC reader to read 10,000 batches with 5
elements each from a file.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/ipc_benchmark.cc#L151-L171)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |         82 |    8.67ms |   8.66ms |        14,083.85 |
| v0.4.0      |         83 |    8.55ms |   8.55ms |        14,098.53 |

## Schema-related benchmarks

Benchmarks for producing and consuming ArrowSchema.

### SchemaInitWideStruct

Benchmark ArrowSchema creation for very wide tables.

Simulates part of the process of creating a very wide table with a
simple column type (integer).

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/schema_benchmark.cc#L45-L56)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |        688 |    1.02ms |   1.02ms |        9,786,782 |
| v0.4.0      |        686 |    1.02ms |   1.02ms |        9,829,446 |

### SchemaViewInitWideStruct

Benchmark ArrowSchema parsing for very wide tables.

Simulates part of the process of consuming a very wide table. Typically
the ArrowSchemaViewInit() is done by ArrowArrayViewInit() but uses a
similar pattern.

[View
Source](https://github.com/paleolimbot/arrow-nanoarrow/blob/benchmark-ipc/dev/benchmarks/c/schema_benchmark.cc#L78-L91)

| preset_name | iterations | real_time | cpu_time | items_per_second |
|:------------|-----------:|----------:|---------:|-----------------:|
| local       |       6666 |     104µs |    104µs |       96,339,782 |
| v0.4.0      |       6733 |     104µs |    104µs |       96,151,786 |


</details>
9 files changed
tree: 03dde1c7daa6dc1ff0b59243ccc267f15a5c9e4b
  1. .github/
  2. ci/
  3. cmake/
  4. dev/
  5. dist/
  6. docs/
  7. examples/
  8. extensions/
  9. python/
  10. r/
  11. src/
  12. thirdparty/
  13. .asf.yaml
  14. .clang-format
  15. .cmake-format
  16. .env
  17. .flake8
  18. .gitattributes
  19. .gitignore
  20. .isort.cfg
  21. .pre-commit-config.yaml
  22. CHANGELOG.md
  23. CMakeLists.txt
  24. CMakePresets.json
  25. CMakeUserPresets.json.example
  26. docker-compose.yml
  27. LICENSE.txt
  28. NOTICE.txt
  29. README.md
  30. valgrind.supp
README.md

nanoarrow

Codecov test coverage Documentation nanoarrow on GitHub

The nanoarrow library is a set of helper functions to interpret and generate Arrow C Data Interface and Arrow C Stream Interface structures. The library is in active early development and users should update regularly from the main branch of this repository.

Whereas the current suite of Arrow implementations provide the basis for a comprehensive data analysis toolkit, this library is intended to support clients that wish to produce or interpret Arrow C Data and/or Arrow C Stream structures where linking to a higher level Arrow binding is difficult or impossible.

Using the C library

The nanoarrow C library is intended to be copied and vendored. This can be done using CMake or by using the bundled nanoarrow.h/nanorrow.c distribution available in the dist/ directory in this repository. Examples of both can be found in the examples/ directory in this repository.

A simple producer example:

#include "nanoarrow.h"

int make_simple_array(struct ArrowArray* array_out, struct ArrowSchema* schema_out) {
  struct ArrowError error;
  array_out->release = NULL;
  schema_out->release = NULL;

  NANOARROW_RETURN_NOT_OK(ArrowArrayInitFromType(array_out, NANOARROW_TYPE_INT32));

  NANOARROW_RETURN_NOT_OK(ArrowArrayStartAppending(array_out));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 1));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 2));
  NANOARROW_RETURN_NOT_OK(ArrowArrayAppendInt(array_out, 3));
  NANOARROW_RETURN_NOT_OK(ArrowArrayFinishBuildingDefault(array_out, &error));

  NANOARROW_RETURN_NOT_OK(ArrowSchemaInitFromType(schema_out, NANOARROW_TYPE_INT32));

  return NANOARROW_OK;
}

A simple consumer example:

#include <stdio.h>

#include "nanoarrow.h"

int print_simple_array(struct ArrowArray* array, struct ArrowSchema* schema) {
  struct ArrowError error;
  struct ArrowArrayView array_view;
  NANOARROW_RETURN_NOT_OK(ArrowArrayViewInitFromSchema(&array_view, schema, &error));

  if (array_view.storage_type != NANOARROW_TYPE_INT32) {
    printf("Array has storage that is not int32\n");
  }

  int result = ArrowArrayViewSetArray(&array_view, array, &error);
  if (result != NANOARROW_OK) {
    ArrowArrayViewReset(&array_view);
    return result;
  }

  for (int64_t i = 0; i < array->length; i++) {
    printf("%d\n", (int)ArrowArrayViewGetIntUnsafe(&array_view, i));
  }

  ArrowArrayViewReset(&array_view);
  return NANOARROW_OK;
}