docs/source/cpp/arrays.rst - arrow - Git at Google

 .. Licensed to the Apache Software Foundation (ASF) under one
 .. or more contributor license agreements.  See the NOTICE file
 .. distributed with this work for additional information
 .. regarding copyright ownership.  The ASF licenses this file
 .. to you under the Apache License, Version 2.0 (the
 .. "License"); you may not use this file except in compliance
 .. with the License.  You may obtain a copy of the License at

 ..   http://www.apache.org/licenses/LICENSE-2.0

 .. Unless required by applicable law or agreed to in writing,
 .. software distributed under the License is distributed on an
 .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 .. KIND, either express or implied.  See the License for the
 .. specific language governing permissions and limitations
 .. under the License.

 .. default-domain:: cpp
 .. highlight:: cpp

 ======
 Arrays
 ======

 .. seealso::
    :doc:`Array API reference <api/array>`

 The central type in Arrow is the class :class:`arrow::Array`.   An array
 represents a known-length sequence of values all having the same type.
 Internally, those values are represented by one or several buffers, the
 number and meaning of which depend on the array's data type, as documented
 in :ref:`the Arrow data layout specification <format_layout>`.

 Those buffers consist of the value data itself and an optional bitmap buffer
 that indicates which array entries are null values.  The bitmap buffer
 can be entirely omitted if the array is known to have zero null values.

 There are concrete subclasses of :class:`arrow::Array` for each data type,
 that help you access individual values of the array.

 Building an array
 =================

 Available strategies
 --------------------

 As Arrow objects are immutable, they cannot be populated directly like for
 example a ``std::vector``.  Instead, several strategies can be used:

 * if the data already exists in memory with the right layout, you can wrap
   said memory inside :class:`arrow::Buffer` instances and then construct
   a :class:`arrow::ArrayData` describing the array;

   .. seealso:: :ref:`cpp_memory_management`

 * otherwise, the :class:`arrow::ArrayBuilder` base class and its concrete
   subclasses help building up array data incrementally, without having to
   deal with details of the Arrow format yourself.

 .. note:: For cases where performance isn't important such as examples or tests,
           you may prefer to use the ``*FromJSONString`` helpers which can create
           Arrays using a JSON text shorthand. See :ref:`fromjsonstring-helpers`.

 Using ArrayBuilder and its subclasses
 -------------------------------------

 To build an ``Int64`` Arrow array, we can use the :class:`arrow::Int64Builder`
 class. In the following example, we build an array of the range 1 to 8 where
 the element that should hold the value 4 is nulled::

    arrow::Int64Builder builder;
    builder.Append(1);
    builder.Append(2);
    builder.Append(3);
    builder.AppendNull();
    builder.Append(5);
    builder.Append(6);
    builder.Append(7);
    builder.Append(8);

    auto maybe_array = builder.Finish();
    if (!maybe_array.ok()) {
       // ... do something on array building failure
    }
    std::shared_ptr<arrow::Array> array = *maybe_array;

 The resulting Array (which can be casted to the concrete :class:`arrow::Int64Array`
 subclass if you want to access its values) then consists of two
 :class:`arrow::Buffer`\s.
 The first buffer holds the null bitmap, which consists here of a single byte with
 the bits ``1|1|1|1|0|1|1|1``. As we use  `least-significant bit (LSB) numbering`_,
 this indicates that the fourth entry in the array is null. The second
 buffer is simply an ``int64_t`` array containing all the above values.
 As the fourth entry is null, the value at that position in the buffer is
 undefined.

 Here is how you could access the concrete array's contents::

    // Cast the Array to its actual type to access its data
    auto int64_array = std::static_pointer_cast<arrow::Int64Array>(array);

    // Get the pointer to the null bitmap
    const uint8_t* null_bitmap = int64_array->null_bitmap_data();

    // Get the pointer to the actual data
    const int64_t* data = int64_array->raw_values();

    // Alternatively, given an array index, query its null bit and value directly
    int64_t index = 2;
    if (!int64_array->IsNull(index)) {
       int64_t value = int64_array->Value(index);
    }

 .. note::
    :class:`arrow::Int64Array` (respectively :class:`arrow::Int64Builder`) is
    just a ``typedef``, provided for convenience, of ``arrow::NumericArray<Int64Type>``
    (respectively ``arrow::NumericBuilder<Int64Type>``).

 .. _least-significant bit (LSB) numbering: https://en.wikipedia.org/wiki/Bit_numbering

 Performance
 -----------

 While it is possible to build an array value-by-value as in the example above,
 to attain highest performance it is recommended to use the bulk appending
 methods (usually named ``AppendValues``) in the concrete :class:`arrow::ArrayBuilder`
 subclasses.

 If you know the number of elements in advance, it is also recommended to
 presize the working area by calling the :func:`~arrow::ArrayBuilder::Resize`
 or :func:`~arrow::ArrayBuilder::Reserve` methods.

 Here is how one could rewrite the above example to take advantage of those
 APIs::

    arrow::Int64Builder builder;
    // Make place for 8 values in total
    builder.Reserve(8);
    // Bulk append the given values (with a null in 4th place as indicated by the
    // validity vector)
    std::vector<bool> validity = {true, true, true, false, true, true, true, true};
    std::vector<int64_t> values = {1, 2, 3, 0, 5, 6, 7, 8};
    builder.AppendValues(values, validity);

    auto maybe_array = builder.Finish();

 If you still must append values one by one, some concrete builder subclasses
 have methods marked "Unsafe" that assume the working area has been correctly
 presized, and offer higher performance in exchange::

    arrow::Int64Builder builder;
    // Make place for 8 values in total
    builder.Reserve(8);
    builder.UnsafeAppend(1);
    builder.UnsafeAppend(2);
    builder.UnsafeAppend(3);
    builder.UnsafeAppendNull();
    builder.UnsafeAppend(5);
    builder.UnsafeAppend(6);
    builder.UnsafeAppend(7);
    builder.UnsafeAppend(8);

    auto maybe_array = builder.Finish();

 Size Limitations and Recommendations
 ====================================

 Some array types are structurally limited to 32-bit sizes.  This is the case
 for list arrays (which can hold up to 2^31 elements), string arrays and binary
 arrays (which can hold up to 2GB of binary data), at least.  Some other array
 types can hold up to 2^63 elements in the C++ implementation, but other Arrow
 implementations can have a 32-bit size limitation for those array types as well.

 For these reasons, it is recommended that huge data be chunked in subsets of
 more reasonable size.

 Chunked Arrays
 ==============

 A :class:`arrow::ChunkedArray` is, like an array, a logical sequence of values;
 but unlike a simple array, a chunked array does not require the entire sequence
 to be physically contiguous in memory.  Also, the constituents of a chunked array
 need not have the same size, but they must all have the same data type.

 A chunked array is constructed by aggregating any number of arrays.  Here we'll
 build a chunked array with the same logical values as in the example above,
 but in two separate chunks::

    std::vector<std::shared_ptr<arrow::Array>> chunks;
    std::shared_ptr<arrow::Array> array;

    // Build first chunk
    arrow::Int64Builder builder;
    builder.Append(1);
    builder.Append(2);
    builder.Append(3);
    if (!builder.Finish(&array).ok()) {
       // ... do something on array building failure
    }
    chunks.push_back(std::move(array));

    // Build second chunk
    builder.Reset();
    builder.AppendNull();
    builder.Append(5);
    builder.Append(6);
    builder.Append(7);
    builder.Append(8);
    if (!builder.Finish(&array).ok()) {
       // ... do something on array building failure
    }
    chunks.push_back(std::move(array));

    auto chunked_array = std::make_shared<arrow::ChunkedArray>(std::move(chunks));

    assert(chunked_array->num_chunks() == 2);
    // Logical length in number of values
    assert(chunked_array->length() == 8);
    assert(chunked_array->null_count() == 1);

 Slicing
 =======

 Like for physical memory buffers, it is possible to make zero-copy slices
 of arrays and chunked arrays, to obtain an array or chunked array referring
 to some logical subsequence of the data.  This is done by calling the
 :func:`arrow::Array::Slice` and :func:`arrow::ChunkedArray::Slice` methods,
 respectively.

 .. _fromjsonstring-helpers:

 FromJSONString Helpers
 ======================

 A set of helper functions is provided for concisely creating Arrays and Scalars
 from JSON_ text. These helpers are intended to be used in examples, tests, or
 for quick prototyping and are not intended to be used where performance matters.
 Most users will want to use the API described in :doc:`json` which provides a
 performant way to create :class:`arrow::Table` and :class:`arrow::RecordBatch`
 objects from line-separated JSON files.

 .. _JSON: https://datatracker.ietf.org/doc/html/rfc8259

 Examples for ``ArrayFromJSONString``, ``ChunkedArrayFromJSONString``,
 ``DictArrayFromJSONString`` are shown below:

 .. literalinclude:: ../../../cpp/examples/arrow/from_json_string_example.cc
    :language: cpp
    :start-after: arrow::Status RunExample() {
    :end-before: return arrow::Status::OK();
    :dedent: 2

 Please see the :ref:`FromJSONString API listing <api-array-from-json-string>` for
 the complete set of helpers.
	.. Licensed to the Apache Software Foundation (ASF) under one
	.. or more contributor license agreements. See the NOTICE file
	.. distributed with this work for additional information
	.. regarding copyright ownership. The ASF licenses this file
	.. to you under the Apache License, Version 2.0 (the
	.. "License"); you may not use this file except in compliance
	.. with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	.. software distributed under the License is distributed on an
	.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	.. KIND, either express or implied. See the License for the
	.. specific language governing permissions and limitations
	.. under the License.

	.. default-domain:: cpp
	.. highlight:: cpp

	======
	Arrays
	======

	.. seealso::
	:doc:`Array API reference <api/array>`

	The central type in Arrow is the class :class:`arrow::Array`. An array
	represents a known-length sequence of values all having the same type.
	Internally, those values are represented by one or several buffers, the
	number and meaning of which depend on the array's data type, as documented
	in :ref:`the Arrow data layout specification <format_layout>`.

	Those buffers consist of the value data itself and an optional bitmap buffer
	that indicates which array entries are null values. The bitmap buffer
	can be entirely omitted if the array is known to have zero null values.

	There are concrete subclasses of :class:`arrow::Array` for each data type,
	that help you access individual values of the array.

	Building an array
	=================

	Available strategies
	--------------------

	As Arrow objects are immutable, they cannot be populated directly like for
	example a ``std::vector``. Instead, several strategies can be used:

	* if the data already exists in memory with the right layout, you can wrap
	said memory inside :class:`arrow::Buffer` instances and then construct
	a :class:`arrow::ArrayData` describing the array;

	.. seealso:: :ref:`cpp_memory_management`

	* otherwise, the :class:`arrow::ArrayBuilder` base class and its concrete
	subclasses help building up array data incrementally, without having to
	deal with details of the Arrow format yourself.

	.. note:: For cases where performance isn't important such as examples or tests,
	you may prefer to use the ``*FromJSONString`` helpers which can create
	Arrays using a JSON text shorthand. See :ref:`fromjsonstring-helpers`.

	Using ArrayBuilder and its subclasses
	-------------------------------------

	To build an ``Int64`` Arrow array, we can use the :class:`arrow::Int64Builder`
	class. In the following example, we build an array of the range 1 to 8 where
	the element that should hold the value 4 is nulled::

	arrow::Int64Builder builder;
	builder.Append(1);
	builder.Append(2);
	builder.Append(3);
	builder.AppendNull();
	builder.Append(5);
	builder.Append(6);
	builder.Append(7);
	builder.Append(8);

	auto maybe_array = builder.Finish();
	if (!maybe_array.ok()) {
	// ... do something on array building failure
	}
	std::shared_ptr<arrow::Array> array = *maybe_array;

	The resulting Array (which can be casted to the concrete :class:`arrow::Int64Array`
	subclass if you want to access its values) then consists of two
	:class:`arrow::Buffer`\s.
	The first buffer holds the null bitmap, which consists here of a single byte with
	the bits ``1\|1\|1\|1\|0\|1\|1\|1``. As we use `least-significant bit (LSB) numbering`_,
	this indicates that the fourth entry in the array is null. The second
	buffer is simply an ``int64_t`` array containing all the above values.
	As the fourth entry is null, the value at that position in the buffer is
	undefined.

	Here is how you could access the concrete array's contents::

	// Cast the Array to its actual type to access its data
	auto int64_array = std::static_pointer_cast<arrow::Int64Array>(array);

	// Get the pointer to the null bitmap
	const uint8_t* null_bitmap = int64_array->null_bitmap_data();

	// Get the pointer to the actual data
	const int64_t* data = int64_array->raw_values();

	// Alternatively, given an array index, query its null bit and value directly
	int64_t index = 2;
	if (!int64_array->IsNull(index)) {
	int64_t value = int64_array->Value(index);
	}

	.. note::
	:class:`arrow::Int64Array` (respectively :class:`arrow::Int64Builder`) is
	just a ``typedef``, provided for convenience, of ``arrow::NumericArray<Int64Type>``
	(respectively ``arrow::NumericBuilder<Int64Type>``).

	.. _least-significant bit (LSB) numbering: https://en.wikipedia.org/wiki/Bit_numbering

	Performance
	-----------

	While it is possible to build an array value-by-value as in the example above,
	to attain highest performance it is recommended to use the bulk appending
	methods (usually named ``AppendValues``) in the concrete :class:`arrow::ArrayBuilder`
	subclasses.

	If you know the number of elements in advance, it is also recommended to
	presize the working area by calling the :func:`~arrow::ArrayBuilder::Resize`
	or :func:`~arrow::ArrayBuilder::Reserve` methods.

	Here is how one could rewrite the above example to take advantage of those
	APIs::

	arrow::Int64Builder builder;
	// Make place for 8 values in total
	builder.Reserve(8);
	// Bulk append the given values (with a null in 4th place as indicated by the
	// validity vector)
	std::vector<bool> validity = {true, true, true, false, true, true, true, true};
	std::vector<int64_t> values = {1, 2, 3, 0, 5, 6, 7, 8};
	builder.AppendValues(values, validity);

	auto maybe_array = builder.Finish();

	If you still must append values one by one, some concrete builder subclasses
	have methods marked "Unsafe" that assume the working area has been correctly
	presized, and offer higher performance in exchange::

	arrow::Int64Builder builder;
	// Make place for 8 values in total
	builder.Reserve(8);
	builder.UnsafeAppend(1);
	builder.UnsafeAppend(2);
	builder.UnsafeAppend(3);
	builder.UnsafeAppendNull();
	builder.UnsafeAppend(5);
	builder.UnsafeAppend(6);
	builder.UnsafeAppend(7);
	builder.UnsafeAppend(8);

	auto maybe_array = builder.Finish();

	Size Limitations and Recommendations
	====================================

	Some array types are structurally limited to 32-bit sizes. This is the case
	for list arrays (which can hold up to 2^31 elements), string arrays and binary
	arrays (which can hold up to 2GB of binary data), at least. Some other array
	types can hold up to 2^63 elements in the C++ implementation, but other Arrow
	implementations can have a 32-bit size limitation for those array types as well.

	For these reasons, it is recommended that huge data be chunked in subsets of
	more reasonable size.

	Chunked Arrays
	==============

	A :class:`arrow::ChunkedArray` is, like an array, a logical sequence of values;
	but unlike a simple array, a chunked array does not require the entire sequence
	to be physically contiguous in memory. Also, the constituents of a chunked array
	need not have the same size, but they must all have the same data type.

	A chunked array is constructed by aggregating any number of arrays. Here we'll
	build a chunked array with the same logical values as in the example above,
	but in two separate chunks::

	std::vector<std::shared_ptr<arrow::Array>> chunks;
	std::shared_ptr<arrow::Array> array;

	// Build first chunk
	arrow::Int64Builder builder;
	builder.Append(1);
	builder.Append(2);
	builder.Append(3);
	if (!builder.Finish(&array).ok()) {
	// ... do something on array building failure
	}
	chunks.push_back(std::move(array));

	// Build second chunk
	builder.Reset();
	builder.AppendNull();
	builder.Append(5);
	builder.Append(6);
	builder.Append(7);
	builder.Append(8);
	if (!builder.Finish(&array).ok()) {
	// ... do something on array building failure
	}
	chunks.push_back(std::move(array));

	auto chunked_array = std::make_shared<arrow::ChunkedArray>(std::move(chunks));

	assert(chunked_array->num_chunks() == 2);
	// Logical length in number of values
	assert(chunked_array->length() == 8);
	assert(chunked_array->null_count() == 1);

	Slicing
	=======

	Like for physical memory buffers, it is possible to make zero-copy slices
	of arrays and chunked arrays, to obtain an array or chunked array referring
	to some logical subsequence of the data. This is done by calling the
	:func:`arrow::Array::Slice` and :func:`arrow::ChunkedArray::Slice` methods,
	respectively.

	.. _fromjsonstring-helpers:

	FromJSONString Helpers
	======================

	A set of helper functions is provided for concisely creating Arrays and Scalars
	from JSON_ text. These helpers are intended to be used in examples, tests, or
	for quick prototyping and are not intended to be used where performance matters.
	Most users will want to use the API described in :doc:`json` which provides a
	performant way to create :class:`arrow::Table` and :class:`arrow::RecordBatch`
	objects from line-separated JSON files.

	.. _JSON: https://datatracker.ietf.org/doc/html/rfc8259

	Examples for ``ArrayFromJSONString``, ``ChunkedArrayFromJSONString``,
	``DictArrayFromJSONString`` are shown below:

	.. literalinclude:: ../../../cpp/examples/arrow/from_json_string_example.cc
	:language: cpp
	:start-after: arrow::Status RunExample() {
	:end-before: return arrow::Status::OK();
	:dedent: 2

	Please see the :ref:`FromJSONString API listing <api-array-from-json-string>` for
	the complete set of helpers.