blob: 21391ff5fbc9e0c38388bdc5023873bda684bd84 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
.. highlight:: console
.. _development-cpp-compute:
============================
Developing Arrow C++ Compute
============================
This section provides information for developers of the Arrow C++ Compute module.
Row Table
=========
The row table in Arrow represents data stored in row-major format. This format
is particularly useful for scenarios involving random access to individual rows
and where all columns are frequently accessed together. It is especially
advantageous for hash-table keys and facilitates efficient operations such as
grouping and hash joins by optimizing memory access patterns and data locality.
Metadata
--------
A row table is defined by its metadata, ``RowTableMetadata``, which includes
information about its schema, alignment, and derived properties.
The schema specifies the types and order of columns. Each row in the row table
contains the data for each column in that logical order (the physical order may
vary; see :ref:`row-encoding` for details).
.. note::
Columns of nested types or large binary types are **not** supported in the
row table.
One important property derived from the schema is whether the row table is
fixed-length or varying-length. A fixed-length row table contains only
fixed-length columns, while a varying-length row table includes at least one
varying-length column. This distinction determines how data is stored and
accessed in the row table.
Each row in the row table is aligned to ``RowTableMetadata::row_alignment``
bytes. Fixed-length columns with non-power-of-2 lengths are also aligned to
``RowTableMetadata::row_alignment`` bytes. Varying-length columns are aligned to
``RowTableMetadata::string_alignment`` bytes.
Buffer Layout
-------------
Similar to most Arrow ``Array``\s, the row table consists of three buffers:
- **Null Masks Buffer**: Indicates null values for each column in each row.
- **Fixed-length Buffer**: Stores row data for fixed-length tables or offsets to
varying-length data for varying-length tables.
- **Varying-length Buffer** (Optional): Contains row data for varying-length
tables; unused for fixed-length tables.
Row Format
----------
Null Masks
~~~~~~~~~~
For each row, a contiguous sequence of bits represents whether each column in
that row is null. Each bit corresponds to a specific column, with ``1``
indicating the value is null and ``0`` indicating the value is valid. Note that
this is the opposite of how the validity bitmap works for ``Array``\s. The null
mask for a row occupies ``RowTableMetadata::null_masks_bytes_per_row`` bytes.
Fixed-length Row Data
~~~~~~~~~~~~~~~~~~~~~
In a fixed-length row table, row data is directly stored in the fixed-length
buffer. All columns in each row are stored sequentially. Notably, a ``boolean``
column is special because, in a normal Arrow ``Array``, it is stored using 1
bit, whereas in a row table, it occupies 1 byte. The varying-length buffer is
not used in this case.
For example, a row table with the schema ``(int32, boolean)`` and rows
``[[7, false], [8, true], [9, false], ...]`` is stored in the fixed-length
buffer as follows:
.. list-table::
:header-rows: 1
* - Row 0
- Row 1
- Row 2
- ...
* - ``7 0 0 0, 0 (padding)``
- ``8 0 0 0, 1 (padding)``
- ``9 0 0 0, 0 (padding)``
- ...
Offsets for Varying-length Row Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In a varying-length row table, the fixed-length buffer contains offsets to the
varying-length row data, which is stored separately in the optional
varying-length buffer. The offsets are of type ``RowTableMetadata::offset_type``
(fixed as ``int64_t``) and indicate the starting position of the row data for
each row.
Varying-length Row Data
~~~~~~~~~~~~~~~~~~~~~~~
In a varying-length row table, the varying-length buffer contains the actual row
data, stored contiguously. The offsets in the fixed-length buffer point to the
starting position of each row's data.
.. _row-encoding:
Row Encoding
^^^^^^^^^^^^
A varying-length row is encoded as follows:
- Fixed-length columns are stored first.
- A sequence of offsets to each varying-length column follows. Each offset is
32-bit and indicates the **end** position within the row data of the
corresponding varying-length column.
- Varying-length columns are stored last.
For example, a row table with the schema ``(int32, string, string, int32)`` and
rows ``[[7, 'Alice', 'x', 0], [8, 'Bob', 'y', 1], [9, 'Charlotte', 'z', 2], ...]``
is stored as follows (assuming 8-byte alignment for varying-length columns):
Fixed-length buffer (row offsets):
.. list-table::
:header-rows: 1
* - Row 0
- Row 1
- Row 2
- Row 3
- ...
* - ``0 0 0 0 0 0 0 0``
- ``32 0 0 0 0 0 0 0``
- ``64 0 0 0 0 0 0 0``
- ``104 0 0 0 0 0 0 0``
- ...
Varying-length buffer (row data):
.. list-table::
:header-rows: 1
* - Row
- Fixed-length Cols
- Varying-length Offsets
- Varying-length Cols
* - 0
- ``7 0 0 0, 0 0 0 0``
- ``21 0 0 0, 25 0 0 0``
- ``Alice~~~x~~~~~~~``
* - 1
- ``8 0 0 0, 1 0 0 0``
- ``19 0 0 0, 25 0 0 0``
- ``Bob~~~~~y~~~~~~~``
* - 2
- ``9 0 0 0, 2 0 0 0``
- ``25 0 0 0, 33 0 0 0``
- ``Charlotte~~~~~~~z~~~~~~~``
* - 3
- ...
- ...
- ...