| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| ************ |
| Introduction |
| ************ |
| |
| Apache Arrow was born from the need for a set of standards around |
| tabular data representation and interchange between systems. |
| The adoption of these standards reduces computing costs of data |
| serialization/deserialization and implementation costs across |
| systems implemented in different programming languages. |
| |
| The Apache Arrow specification can be implemented in any programming |
| language but official implementations for many languages are available. |
| An implementation consists of format definitions using the constructs |
| offered by the language and common in-memory data processing algorithms |
| (e.g. slicing and concatenating). Users can extend and use the utilities |
| provided by the Apache Arrow implementation in their programming |
| language of choice. Some implementations are further ahead and feature a |
| vast set of algorithms for in-memory analytical data processing. More detail |
| about how implementations differ can be found on the :doc:`../status` page. |
| |
| Apart from this initial vision, Arrow has grown to also develop a |
| multi-language collection of libraries for solving problems related to |
| in-memory analytical data processing. This covers topics like: |
| |
| * Zero-copy shared memory and RPC-based data movement |
| * Reading and writing file formats (like CSV, `Apache ORC`_, and `Apache Parquet`_) |
| * In-memory analytics and query processing |
| |
| .. _Apache ORC: https://orc.apache.org/ |
| .. _Apache Parquet: https://parquet.apache.org/ |
| |
| Arrow Columnar Format |
| ===================== |
| |
| Apache Arrow focuses on tabular data. For an example, let's consider |
| we have data that can be organized into a table: |
| |
| .. figure:: ./images/columnar-diagram_1.svg |
| :scale: 70% |
| :alt: Diagram with tabular data of 4 rows and columns. |
| |
| Diagram of a tabular data structure. |
| |
| Tabular data can be represented in memory using a row-based format or a |
| column-based format. The row-based format stores data row-by-row, meaning the rows |
| are adjacent in the computer memory: |
| |
| .. figure:: ./images/columnar-diagram_2.svg |
| :alt: Tabular data being structured row by row in computer memory. |
| |
| Tabular data being saved in memory row by row. |
| |
| In a columnar format, the data is organized column-by-column instead. |
| This organization makes analytical operations like filtering, grouping, |
| aggregations and others, more efficient thanks to memory locality. |
| When processing the data, the memory locations accessed by the CPU tend |
| to be near one another. By keeping the data contiguous in memory, it also |
| enables vectorization of the computations. Most modern CPUs have |
| `SIMD instructions`_ (a single instruction that operates on multiple values at |
| once) enabling parallel processing and execution of operations on vector data |
| using a single CPU instruction. |
| |
| .. _SIMD instructions: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data |
| |
| Apache Arrow is solving this exact problem. It is the specification that |
| uses the columnar layout. |
| |
| .. figure:: ./images/columnar-diagram_3.svg |
| :alt: Tabular data being structured column by column in computer memory. |
| |
| The same tabular data being saved in memory column by column. |
| |
| Each column is called an **Array** in Arrow terminology. Arrays can be of |
| different data types and the way their values are stored in memory varies among |
| the data types. The specification of how these values are arranged in memory is |
| what we call a **physical memory layout**. One contiguous region of memory that |
| stores data for arrays is called a **Buffer**. An array consists of one or more |
| buffers. |
| |
| Next sections give an introduction to Arrow Columnar Format explaining the |
| different physical layouts. The full specification of the format can be found |
| at :ref:`format_columnar`. |
| |
| Support for Null Values |
| ======================= |
| |
| Arrow supports missing values or "nulls" for all data types: any value |
| in an array may be semantically null, whether primitive or nested data type. |
| |
| In Arrow, a dedicated buffer, known as the validity (or "null") bitmap, |
| is used alongside the data indicating whether each value in the array is |
| null or not: a value of 1 means that the value is not-null ("valid"), whereas |
| a value of 0 indicates that the value is null. |
| |
| This validity bitmap is optional: if there are no missing values in |
| the array the buffer does not need to be allocated (as in the example |
| column 1 in the diagram below). |
| |
| .. note:: |
| |
| We read validity bitmaps right-to-left within a group of 8 bits due to |
| `least-significant bit numbering <https://en.wikipedia.org/wiki/Bit_numbering>`_ |
| being used. |
| |
| This is also how we have represented the validity bitmaps in the |
| diagrams included in this document. |
| |
| Primitive Layouts |
| ================= |
| |
| Fixed Size Primitive Layout |
| --------------------------- |
| |
| A primitive column represents an array of values where each value |
| has the same physical size measured in bytes. Data types that use the |
| fixed size primitive layout are, for example, signed and unsigned |
| integer data types, floating point numbers, boolean, decimal and temporal |
| data types. |
| |
| .. figure:: ./images/primitive-diagram.svg |
| :alt: Diagram is showing the difference between the primitive data |
| type presented in a Table and the data actually stored in |
| computer memory. |
| |
| Physical layout diagram for primitive data types. |
| |
| .. note:: |
| The boolean data type is represented with a primitive layout where the |
| values are encoded in bits instead of bytes. That means the physical |
| layout includes a values bitmap buffer and possibly a validity bitmap |
| buffer. |
| |
| .. figure:: ./images/bool-diagram.svg |
| :alt: Diagram is showing the difference between the boolean data |
| type presented in a Table and the data actually stored in |
| computer memory. |
| |
| Physical layout diagram for boolean data type. |
| |
| .. note:: |
| Arrow also has a concept of Null data type where all values are null. In |
| this case no buffers are allocated. |
| |
| Variable length binary and string |
| --------------------------------- |
| |
| In contrast to the fixed size primitive layout, the variable length layout |
| allows representing an array where each element can have a variable size |
| in bytes. This layout is used for binary and string data. |
| |
| The bytes of all elements in a binary or string column are stored together |
| consecutively in a single buffer or region of memory. To know where each element |
| of the column starts and ends, the physical layout also includes integer offsets. |
| The offsets buffer is always one element longer than the array. |
| The last two offsets define the start and the end of the last |
| binary/string element. |
| |
| Binary and string data types share the same physical layout. The only |
| difference between them is that a string-typed array is assumed to contain |
| valid UTF-8 string data. |
| |
| The difference between binary/string and large binary/string is in the offset |
| data type. In the first case that is int32 and in the second it is int64. |
| |
| The limitation of data types using 32 bit offsets is that they have a maximum size of |
| 2GB per array. One can still use the non-large variants for bigger data, but |
| then multiple chunks are needed. |
| |
| .. figure:: ./images/var-string-diagram.svg |
| :alt: Diagram is showing the difference between the variable length |
| string data type presented in a Table and the data actually |
| stored in computer memory. |
| |
| Physical layout diagram for variable length string data types. |
| |
| Variable length binary and string view |
| -------------------------------------- |
| |
| .. _UmbraDB: https://umbra-db.com/ |
| .. _DuckDB: https://duckdb.com |
| .. _Velox: https://velox-lib.io/ |
| |
| This layout is an alternative for the variable length binary layout and is adapted |
| from TU Munich's `UmbraDB`_ and is similar to the string layout used in `DuckDB`_ and |
| `Velox`_ (and sometimes also called "German strings"). |
| |
| The main difference to the classical binary and string layout is the views buffer. |
| It includes the length of the string, and then either its characters appearing |
| inline (for small strings) or only the first 4 bytes of the string and an offset into |
| one of the potentially several data buffers. Because it uses an offset and length to refer |
| to the data buffer, the bytes of all elements do not need to be stored |
| consecutively in a single buffer. This enables out of order writing of |
| variable length elements into the array. |
| |
| These properties are important for efficient string processing. The prefix |
| enables a profitable fast path for string comparisons, which are frequently |
| determined within the first four bytes. Selecting elements is a simple "gather" |
| operation on the fixed-width views buffer and does not need to rewrite the |
| values buffers. |
| |
| .. figure:: ./images/var-string-view-diagram.svg |
| :alt: Diagram is showing the difference between the variable length |
| string view data type presented in a Table and the data actually |
| stored in computer memory. |
| |
| Physical layout diagram for variable length string view data type. |
| |
| Nested Layouts |
| ============== |
| |
| Nested data types introduce the concept of parent and child arrays. They express |
| relationships between physical value arrays in a nested data type structure. |
| |
| Nested data types depend on one or more other child data types. For instance, List |
| is a nested data type (parent) that has one child (the data type of the values in |
| the list). |
| |
| List |
| ---- |
| |
| The list data type enables representing an array where each element is a sequence |
| of elements of the same data type. The layout is similar to variable-size binary |
| or string layout as it has an offsets buffer to define where the sequence of values |
| for each element starts and ends, with all the values being stored consecutively |
| in a values child array. |
| |
| The offsets in the list data type are int32 while in the large list the offsets |
| are int64. |
| |
| .. figure:: ./images/var-list-diagram.svg |
| :alt: Diagram is showing the difference between the variable size |
| list data type presented in a Table and the data actually |
| stored in computer memory. |
| |
| Physical layout diagram for variable size list data type. |
| |
| Fixed Size List |
| --------------- |
| |
| Fixed-size list is a special case of variable-size list where each column slot |
| contains a fixed size sequence meaning all lists are the same size and so the |
| offset buffer is no longer needed. |
| |
| .. figure:: ./images/fixed-list-diagram.svg |
| :alt: Diagram is showing the difference between the fixed size list data |
| type presented in a Table and the data actually stored in computer |
| memory. |
| |
| Physical layout diagram for fixed size list data type. |
| |
| List View |
| --------- |
| |
| In contrast to the list type, list view type also has a size buffer together |
| with an offset buffer. The offsets continue to indicate the start of each |
| element but size is now saved in a separate size buffer. This allows |
| out-of-order offsets as the sizes aren't derived from the consecutive |
| offsets anymore. |
| |
| .. figure:: ./images/var-list-view-diagram.svg |
| :alt: Diagram is showing the difference between the variable size list view |
| data type presented in a Table and the data actually stored in |
| computer memory. |
| |
| Physical layout diagram for variable size list view data type. |
| |
| Struct |
| ------ |
| |
| A struct is a nested data type parameterized by an ordered sequence of fields |
| (a data type and a name). |
| |
| * There is one child array for each field |
| * Child arrays are independent and need not be adjacent to each other in |
| memory. They only need to have the same length. |
| |
| One can think of an individual struct field as a key-value pair where the |
| key is the field name and the child array its values. The field (key) is |
| saved in the schema and the values of a specific field (key) are saved in |
| the child array. |
| |
| .. figure:: ./images/struct-diagram.svg |
| :alt: Diagram is showing the difference between the struct data type |
| presented in a Table and the data actually stored in computer |
| memory. |
| |
| Physical layout diagram for struct data type. |
| |
| Map |
| --- |
| |
| The Map data type represents nested data where each value is a variable number of |
| key-value pairs. Its physical representation is the same as a list of ``{key, value}`` |
| structs. |
| |
| The difference between the struct and map data types is that a struct holds the key |
| in the schema, requiring keys to be strings, and the values are stored in the |
| child arrays, |
| one for each field. There can be multiple keys and therefore multiple child arrays. |
| The map, on the other hand, has one child array holding all the different keys (that |
| thus all need to be of the same data type, but not necessarily strings) and a second |
| child array holding all the values. The values need to be of the same data type; however, |
| the data type doesn't have to match that of the keys. |
| |
| Also, the map stores the struct in a list and needs an offset as the list is |
| variable shape. |
| |
| .. figure:: ./images/map-diagram.svg |
| :alt: Diagram is showing the difference between the map data type |
| presented in a Table and the data actually stored in computer |
| memory. |
| |
| Physical layout diagram for map data type. |
| |
| Union |
| ----- |
| |
| The union is a nested data type where each slot in the union has a value with a data type |
| chosen from a subset of possible Arrow data types. That means that a union array represents |
| a mixed-type array. Unlike other data types, unions do not have their own validity bitmap |
| and the nullness is determined by the child arrays. |
| |
| Arrow defines two distinct union data types, "dense" and "sparse". |
| |
| Dense Union |
| ^^^^^^^^^^^ |
| |
| A Dense Union has one child array for each data type present in the mixed-type array and |
| two buffers of its own: |
| |
| * **Types buffer:** holds data type id for each slot of the array. Data type id is |
| frequently the index of the child array; however, the relationship between data type |
| ID and the child index is a parameter of the data type. |
| * **Offsets buffer:** holds relative offset into the respective child array for each |
| array slot. |
| |
| .. figure:: ./images/dense-union-diagram.svg |
| :alt: Diagram is showing the difference between the dense union data type |
| presented in a Table and the data actually stored in computer |
| memory. |
| |
| Physical layout diagram for dense union data type. |
| |
| Sparse union |
| ^^^^^^^^^^^^ |
| |
| A sparse union has the same structure as a dense union, with the omission of the offsets |
| buffer. In this case, the child arrays are each equal in length to the length of the union. |
| |
| |
| .. figure:: ./images/sparse-union-diagram.svg |
| :alt: Diagram is showing the difference between the sparse union data type |
| presented in a Table and the data actually stored in computer |
| memory. |
| |
| Physical layout diagram for sparse union data type. |
| |
| Dictionary Encoded Layout |
| ========================= |
| |
| Dictionary encoding can be effective when one has data with many repeated values. |
| The values are represented by integers referencing a dictionary usually consisting of |
| unique values. |
| |
| .. figure:: ./images/dictionary-diagram.svg |
| :alt: Diagram is showing the difference between the dictionary data type |
| presented in a Table and the data actually stored in computer |
| memory. |
| |
| Physical layout diagram for dictionary data type. |
| |
| Run-End Encoded Layout |
| ====================== |
| |
| Run-end encoding is well-suited for representing data containing sequences of the |
| same value. These sequences are called runs. A run-end encoded array has no buffers |
| of its own, but has two child arrays: |
| |
| * **Run ends array:** holds the index in the array where each run ends. The number |
| of run ends is the same as the length of its parent array. |
| * **Values array:** the actual values without repetitions (together with null values). |
| |
| Note that nulls of the parent array are strictly represented in the values array. |
| |
| .. figure:: ./images/ree-diagram.svg |
| :alt: Diagram is showing the difference between the run-end encoded data |
| type presented in a Table and the data actually stored in computer |
| memory. |
| |
| Physical layout diagram for run-end encoded data type. |
| |
| .. seealso:: |
| Table of all Arrow :ref:`data_types`. |
| |
| Overview of Arrow Terminology |
| ============================= |
| |
| **Physical layout** |
| A specification for how to represent values of an array in memory. |
| |
| **Buffer** |
| A contiguous region of memory with a given length in bytes. Buffers are used to store data |
| for arrays. Sometimes we use the notion of number of elements in a buffer which can only be |
| used if we know the data type of the array that wraps this specific buffer. |
| |
| **Array** |
| A contiguous, one-dimensional sequence of values with known length where all values have the |
| same data type. An array consists of zero or more buffers. |
| |
| **Chunked Array** |
| A discontiguous, one-dimensional sequence of values with known length where all values have |
| the same data type. Consists of zero or more arrays, the “chunks”. |
| |
| .. note:: |
| Chunked Array is a concept specific to certain implementations such as Arrow C++ and PyArrow. |
| |
| **RecordBatch** |
| A contiguous, two-dimensional data structure which consists of an ordered collection of arrays |
| of the same length. |
| |
| **Schema** |
| An ordered collection of fields that communicates all the data types of an object |
| like a RecordBatch or Table. Schemas can contain optional key/value metadata. |
| |
| **Field** |
| A Field includes a field name, a data type, a nullability flag and |
| optional key-value metadata for a specific column in a RecordBatch. |
| |
| **Table** |
| A discontiguous, two-dimensional chunk of data consisting of an ordered collection of Chunked |
| Arrays. All Chunked Arrays have the same length, but may have different types. Different columns |
| may be chunked differently. |
| |
| .. note:: |
| Table is a concept specific to certain implementations such as Arrow C++ and PyArrow. In Java |
| implementation, for example, a Table is not a collection of Chunked Arrays but a collection of |
| RecordBatches. |
| |
| .. image:: ../cpp/tables-versus-record-batches.svg |
| :alt: A graphical representation of an Arrow Table and a |
| Record Batch, with structure as described in text above. |
| |
| .. seealso:: |
| The :ref:`glossary` for more terms. |
| |
| Extension Types |
| =============== |
| |
| In case the system or application needs to extend standard Arrow data types with |
| custom semantics, this is enabled by defining extension types. |
| |
| Examples of an extension type are :ref:`uuid_extension` or |
| :ref:`fixed_shape_tensor_extension` extension type. |
| |
| Extension types can be defined by annotating any of the built-in Arrow data types |
| (the “storage type”) with a custom type name and optional serialized representation |
| (``'ARROW:extension:name'`` and ``'ARROW:extension:metadata'`` keys in the Field |
| metadata structure). |
| |
| .. seealso:: |
| The :ref:`format_metadata_extension_types` documentation. |
| |
| Canonical Extension Types |
| ------------------------- |
| |
| It is beneficial to share the definitions of well-known extension types so as to |
| improve interoperability between different systems integrating Arrow columnar data. |
| For this reason canonical extension types are defined in Arrow itself. |
| |
| .. seealso:: |
| The :ref:`format_canonical_extensions` documentation. |
| |
| Community Extension Types |
| ------------------------- |
| These are Arrow extension types that have been established as standards within specific |
| domain areas. |
| |
| Example: |
| |
| * `GeoArrow`_: A collection of Arrow extension types for representing vector geometries |
| |
| .. _GeoArrow: https://geoarrow.org |
| |
| Sharing Arrow data |
| ================== |
| |
| Arrow memory layout is meant to be a universal standard for representing tabular data in memory, |
| not tied to a specific implementation. The Arrow standard defines two protocols for |
| well-defined and unambiguous communication of Arrow data between applications: |
| |
| * Protocol to share Arrow data between processes or over the network is called :ref:`format-ipc`. |
| The specification for sharing data is called IPC message format which defines how Arrow |
| array or record batch buffers are stacked together to be serialized and deserialized. |
| |
| * To share Arrow data in the same process :ref:`c-data-interface` is used, meant for sharing |
| the same buffer zero-copy in memory between different libraries within the same process. |