| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| .. _format_integration_testing: |
| |
| Integration Testing |
| =================== |
| |
| To ensure Arrow implementations are interoperable between each other, |
| the Arrow project includes cross-language integration tests which are |
| regularly run as Continuous Integration tasks. |
| |
| The integration tests exercise compliance with several Arrow specifications: |
| the :ref:`IPC format <format-ipc>`, the :ref:`Flight RPC <flight-rpc>` protocol, |
| and the :ref:`C Data Interface <c-data-interface>`. |
| |
| Strategy |
| -------- |
| |
| Our strategy for integration testing between Arrow implementations is: |
| |
| * Test datasets are specified in a custom human-readable, |
| :ref:`JSON-based format <format_json_integration>` designed exclusively |
| for Arrow's integration tests. |
| |
| * The JSON files are generated by the integration test harness. Different |
| files are used to represent different data types and features, such as |
| numerics, lists, dictionary encoding, etc. This makes it easier to pinpoint |
| incompatibilities than if all data types were represented in a single file. |
| |
| * Each implementation provides entry points capable of converting |
| between the JSON and the Arrow in-memory representation, and of exposing |
| Arrow in-memory data using the desired format. |
| |
| * Each format (whether Arrow IPC, Flight or the C Data Interface) is tested for |
| all supported pairs of (producer, consumer) implementations. The producer |
| typically reads a JSON file, converts it to in-memory Arrow data, and exposes |
| this data using the format under test. The consumer reads the data in the |
| said format and converts it back to Arrow in-memory data; it also reads |
| the same JSON file as the producer, and validates that both datasets are |
| identical. |
| |
| Example: IPC format |
| ~~~~~~~~~~~~~~~~~~~ |
| |
| Let's say we are testing Arrow C++ as a producer and Arrow Java as a consumer |
| of the Arrow IPC format. Testing a JSON file would go as follows: |
| |
| #. A C++ executable reads the JSON file, converts it into Arrow in-memory data |
| and writes an Arrow IPC file (the file paths are typically given on the command |
| line). |
| |
| #. A Java executable reads the JSON file, converts it into Arrow in-memory data; |
| it also reads the Arrow IPC file generated by C++. Finally, it validates that |
| both Arrow in-memory datasets are equal. |
| |
| Example: C Data Interface |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Now, let's say we are testing Arrow Go as a producer and Arrow C# as a consumer |
| of the Arrow C Data Interface. |
| |
| #. The integration testing harness allocates a C |
| :ref:`ArrowArray <c-data-interface-struct-defs>` structure on the heap. |
| |
| #. A Go in-process entrypoint (for example a C-compatible function call) |
| reads a JSON file and exports one of its :term:`record batches <record batch>` |
| into the ``ArrowArray`` structure. |
| |
| #. A C# in-process entrypoint reads the same JSON file, converts the |
| same record batch into Arrow in-memory data; it also imports the |
| record batch exported by Arrow Go in the ``ArrowArray`` structure. |
| It validates that both record batches are equal, and then releases the |
| imported record batch. |
| |
| #. Depending on the implementation languages' abilities, the integration |
| testing harness may assert that memory consumption remained identical |
| (i.e., that the exported record batch didn't leak). |
| |
| #. At the end, the integration testing harness deallocates the ``ArrowArray`` |
| structure. |
| |
| .. _running_integration_tests: |
| |
| Running integration tests |
| ------------------------- |
| |
| The integration test data generator and runner are implemented inside |
| the :ref:`Archery <archery>` utility. You need to install the ``integration`` |
| component of archery: |
| |
| .. code:: console |
| |
| $ pip install -e "dev/archery[integration]" |
| |
| The integration tests are run using the ``archery integration`` command. |
| |
| .. code-block:: console |
| |
| $ archery integration --help |
| |
| In order to run integration tests, you'll first need to build each component |
| you want to include. See the respective developer docs for C++, Java, etc. |
| for instructions on building those. |
| |
| Some languages may require additional build options to enable integration |
| testing. For C++, for example, you need to add ``-DARROW_BUILD_INTEGRATION=ON`` |
| to your cmake command. |
| |
| Depending on which components you have built, you can enable and add them to |
| the archery test run. For example, if you only have the C++ project built |
| and want to run the Arrow IPC integration tests, run: |
| |
| .. code-block:: shell |
| |
| archery integration --run-ipc --with-cpp=1 |
| |
| For Java, it may look like: |
| |
| .. code-block:: shell |
| |
| VERSION=14.0.0-SNAPSHOT |
| export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar |
| archery integration --run-ipc --with-cpp=1 --with-java=1 |
| |
| To run all tests, including Flight and C Data Interface integration tests, do: |
| |
| .. code-block:: shell |
| |
| archery integration --with-all --run-flight --run-ipc --run-c-data |
| |
| Note that we run these tests in continuous integration, and the CI job uses |
| Docker Compose. You may also run the Docker Compose job locally, or at least |
| refer to it if you have questions about how to build other languages or enable |
| certain tests. |
| |
| See :ref:`docker-builds` for more information about the project's |
| ``docker compose`` configuration. |
| |
| .. _format_json_integration: |
| |
| JSON test data format |
| --------------------- |
| |
| A JSON representation of Arrow columnar data is provided for |
| cross-language integration testing purposes. |
| This representation is `not canonical <https://lists.apache.org/thread.html/6947fb7666a0f9cc27d9677d2dad0fb5990f9063b7cf3d80af5e270f%40%3Cdev.arrow.apache.org%3E>`_ |
| but it provides a human-readable way of verifying language implementations. |
| |
| See `here <https://github.com/apache/arrow/tree/main/docs/source/format/integration_json_examples>`_ |
| for some examples of this JSON data. |
| |
| .. can we check in more examples, e.g. from the generated_*.json test files? |
| |
| The high level structure of a JSON integration test files is as follows: |
| |
| **Data file** :: |
| |
| { |
| "schema": /*Schema*/, |
| "batches": [ /*RecordBatch*/ ], |
| "dictionaries": [ /*DictionaryBatch*/ ], |
| } |
| |
| All files contain ``schema`` and ``batches``, while ``dictionaries`` is only |
| present if there are dictionary type fields in the schema. |
| |
| **Schema** :: |
| |
| { |
| "fields" : [ |
| /* Field */ |
| ], |
| "metadata" : /* Metadata */ |
| } |
| |
| **Field** :: |
| |
| { |
| "name" : "name_of_the_field", |
| "nullable" : /* boolean */, |
| "type" : /* Type */, |
| "children" : [ /* Field */ ], |
| "dictionary": { |
| "id": /* integer */, |
| "indexType": /* Type */, |
| "isOrdered": /* boolean */ |
| }, |
| "metadata" : /* Metadata */ |
| } |
| |
| The ``dictionary`` attribute is present if and only if the ``Field`` corresponds to a |
| dictionary type, and its ``id`` maps onto a column in the ``DictionaryBatch``. In this |
| case the ``type`` attribute describes the value type of the dictionary. |
| |
| For primitive types, ``children`` is an empty array. |
| |
| **Metadata** :: |
| |
| null | |
| [ { |
| "key": /* string */, |
| "value": /* string */ |
| } ] |
| |
| A key-value mapping of custom metadata. It may be omitted or null, in which case it is |
| considered equivalent to ``[]`` (no metadata). Duplicated keys are not forbidden here. |
| |
| **Type**: :: |
| |
| { |
| "name" : "null|struct|list|largelist|listview|largelistview|fixedsizelist|union|int|floatingpoint|utf8|largeutf8|binary|largebinary|utf8view|binaryview|fixedsizebinary|bool|decimal|date|time|timestamp|interval|duration|map|runendencoded" |
| } |
| |
| A ``Type`` will have other fields as defined in |
| `Schema.fbs <https://github.com/apache/arrow/tree/main/format/Schema.fbs>`_ |
| depending on its name. |
| |
| Int: :: |
| |
| { |
| "name" : "int", |
| "bitWidth" : /* integer */, |
| "isSigned" : /* boolean */ |
| } |
| |
| FloatingPoint: :: |
| |
| { |
| "name" : "floatingpoint", |
| "precision" : "HALF|SINGLE|DOUBLE" |
| } |
| |
| FixedSizeBinary: :: |
| |
| { |
| "name" : "fixedsizebinary", |
| "byteWidth" : /* byte width */ |
| } |
| |
| Decimal: :: |
| |
| { |
| "name" : "decimal", |
| "precision" : /* integer */, |
| "scale" : /* integer */ |
| } |
| |
| Timestamp: :: |
| |
| { |
| "name" : "timestamp", |
| "unit" : "$TIME_UNIT", |
| "timezone": "$timezone" |
| } |
| |
| ``$TIME_UNIT`` is one of ``"SECOND|MILLISECOND|MICROSECOND|NANOSECOND"`` |
| |
| "timezone" is an optional string. |
| |
| Duration: :: |
| |
| { |
| "name" : "duration", |
| "unit" : "$TIME_UNIT" |
| } |
| |
| Date: :: |
| |
| { |
| "name" : "date", |
| "unit" : "DAY|MILLISECOND" |
| } |
| |
| Time: :: |
| |
| { |
| "name" : "time", |
| "unit" : "$TIME_UNIT", |
| "bitWidth": /* integer: 32 or 64 */ |
| } |
| |
| Interval: :: |
| |
| { |
| "name" : "interval", |
| "unit" : "YEAR_MONTH|DAY_TIME" |
| } |
| |
| Union: :: |
| |
| { |
| "name" : "union", |
| "mode" : "SPARSE|DENSE", |
| "typeIds" : [ /* integer */ ] |
| } |
| |
| The ``typeIds`` field in ``Union`` are the codes used to denote which member of |
| the union is active in each array slot. Note that in general these discriminants are not identical |
| to the index of the corresponding child array. |
| |
| List: :: |
| |
| { |
| "name": "list" |
| } |
| |
| The type that the list is a "list of" will be included in the ``Field``'s |
| "children" member, as a single ``Field`` there. For example, for a list of |
| ``int32``, :: |
| |
| { |
| "name": "list_nullable", |
| "type": { |
| "name": "list" |
| }, |
| "nullable": true, |
| "children": [ |
| { |
| "name": "item", |
| "type": { |
| "name": "int", |
| "isSigned": true, |
| "bitWidth": 32 |
| }, |
| "nullable": true, |
| "children": [] |
| } |
| ] |
| } |
| |
| FixedSizeList: :: |
| |
| { |
| "name": "fixedsizelist", |
| "listSize": /* integer */ |
| } |
| |
| This type likewise comes with a length-1 "children" array. |
| |
| Struct: :: |
| |
| { |
| "name": "struct" |
| } |
| |
| The ``Field``'s "children" contains an array of ``Fields`` with meaningful |
| names and types. |
| |
| Map: :: |
| |
| { |
| "name": "map", |
| "keysSorted": /* boolean */ |
| } |
| |
| The ``Field``'s "children" contains a single ``struct`` field, which itself |
| contains 2 children, named "key" and "value". |
| |
| Null: :: |
| |
| { |
| "name": "null" |
| } |
| |
| RunEndEncoded: :: |
| |
| { |
| "name": "runendencoded" |
| } |
| |
| The ``Field``'s "children" should be exactly two child fields. The first |
| child must be named "run_ends", be non-nullable and be either an ``int16``, |
| ``int32``, or ``int64`` type field. The second child must be named "values", |
| but can be of any type. |
| |
| Extension types are, as in the IPC format, represented as their underlying |
| storage type plus some dedicated field metadata to reconstruct the extension |
| type. For example, assuming a "rational" extension type backed by a |
| ``struct<numer: int32, denom: int32>`` storage, here is how a "rational" field |
| would be represented:: |
| |
| { |
| "name" : "name_of_the_field", |
| "nullable" : /* boolean */, |
| "type" : { |
| "name" : "struct" |
| }, |
| "children" : [ |
| { |
| "name": "numer", |
| "type": { |
| "name": "int", |
| "bitWidth": 32, |
| "isSigned": true |
| } |
| }, |
| { |
| "name": "denom", |
| "type": { |
| "name": "int", |
| "bitWidth": 32, |
| "isSigned": true |
| } |
| } |
| ], |
| "metadata" : [ |
| {"key": "ARROW:extension:name", "value": "rational"}, |
| {"key": "ARROW:extension:metadata", "value": "rational-serialized"} |
| ] |
| } |
| |
| **RecordBatch**:: |
| |
| { |
| "count": /* integer number of rows */, |
| "columns": [ /* FieldData */ ] |
| } |
| |
| **DictionaryBatch**:: |
| |
| { |
| "id": /* integer */, |
| "data": [ /* RecordBatch */ ] |
| } |
| |
| **FieldData**:: |
| |
| { |
| "name": "field_name", |
| "count" "field_length", |
| "$BUFFER_TYPE": /* BufferData */ |
| ... |
| "$BUFFER_TYPE": /* BufferData */ |
| "children": [ /* FieldData */ ] |
| } |
| |
| The "name" member of a ``Field`` in the ``Schema`` corresponds to the "name" |
| of a ``FieldData`` contained in the "columns" of a ``RecordBatch``. |
| For nested types (list, struct, etc.), ``Field``'s "children" each have a |
| "name" that corresponds to the "name" of a ``FieldData`` inside the |
| "children" of that ``FieldData``. |
| For ``FieldData`` inside of a ``DictionaryBatch``, the "name" field does not |
| correspond to anything. |
| |
| Here ``$BUFFER_TYPE`` is one of ``VALIDITY``, ``OFFSET`` (for |
| variable-length types, such as strings and lists), ``TYPE_ID`` (for unions), |
| or ``DATA``. |
| |
| ``BufferData`` is encoded based on the type of buffer: |
| |
| * ``VALIDITY``: a JSON array of 1 (valid) and 0 (null). Data for non-nullable |
| ``Field`` still has a ``VALIDITY`` array, even though all values are 1. |
| * ``OFFSET``: a JSON array of integers for 32-bit offsets or |
| string-formatted integers for 64-bit offsets. |
| * ``TYPE_ID``: a JSON array of integers. |
| * ``DATA``: a JSON array of encoded values. |
| * ``VARIADIC_DATA_BUFFERS``: a JSON array of data buffers represented as |
| hex encoded strings. |
| * ``VIEWS``: a JSON array of encoded views, which are JSON objects with: |
| |
| * ``SIZE``: an integer indicating the size of the view, |
| * ``INLINED``: an encoded value (this field will be present if ``SIZE`` |
| is smaller than 12, otherwise the next three fields will be present), |
| * ``PREFIX_HEX``: the first four bytes of the view encoded as hex, |
| * ``BUFFER_INDEX``: the index in ``VARIADIC_DATA_BUFFERS`` of the buffer |
| viewed, |
| * ``OFFSET``: the offset in the buffer viewed. |
| |
| The value encoding for ``DATA`` is different depending on the logical |
| type: |
| |
| * For boolean type: an array of 1 (true) and 0 (false). |
| * For integer-based types (including timestamps): an array of JSON numbers. |
| * For 64-bit integers: an array of integers formatted as JSON strings, |
| so as to avoid loss of precision. |
| * For floating point types: an array of JSON numbers. Values are limited |
| to 3 decimal places to avoid loss of precision. |
| * For binary types, an array of uppercase hex-encoded strings, so as |
| to represent arbitrary binary data. |
| * For UTF-8 string types, an array of JSON strings. |
| |
| For "list" and "largelist" types, ``BufferData`` has ``VALIDITY`` and |
| ``OFFSET``, and the rest of the data is inside "children". These child |
| ``FieldData`` contain all of the same attributes as non-child data, so in |
| the example of a list of ``int32``, the child data has ``VALIDITY`` and |
| ``DATA``. |
| |
| For "fixedsizelist", there is no ``OFFSET`` member because the offsets are |
| implied by the field's "listSize". |
| |
| Note that the "count" for these child data may not match the parent "count". |
| For example, if a ``RecordBatch`` has 7 rows and contains a ``FixedSizeList`` |
| of ``listSize`` 4, then the data inside the "children" of that ``FieldData`` |
| will have count 28. |
| |
| For "null" type, ``BufferData`` does not contain any buffers. |
| |
| Archery Integration Test Cases |
| ------------------------------ |
| |
| This list can make it easier to understand what manual testing may need to |
| be done for any future Arrow Format changes by knowing what cases the automated |
| integration testing actually tests. |
| |
| There are two types of integration test cases: the ones populated on the fly |
| by the data generator in the Archery utility, and *gold* files that exist |
| in the `arrow-testing <https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration>`_ |
| repository. |
| |
| Data Generator Tests |
| ~~~~~~~~~~~~~~~~~~~~ |
| |
| This is the high-level description of the cases which are generated and |
| tested using the ``archery integration`` command (see ``get_generated_json_files`` |
| in ``datagen.py``): |
| |
| * Primitive Types |
| - No Batches |
| - Various Primitive Values |
| - Batches with Zero Length |
| - String and Binary Large offset cases |
| * Null Type |
| * Trivial Null batches |
| * Decimal128 |
| * Decimal256 |
| * DateTime with various units |
| * Durations with various units |
| * Intervals |
| - MonthDayNano interval is a separate case |
| * Map Types |
| - Non-Canonical Maps |
| * Nested Types |
| - Lists |
| - Structs |
| - Lists with Large Offsets |
| * Unions |
| * Custom Metadata |
| * Schemas with Duplicate Field Names |
| * Dictionary Types |
| - Signed indices |
| - Unsigned indices |
| - Nested dictionaries |
| * Run end encoded |
| * Binary view and string view |
| * List view and large list view |
| * Extension Types |
| |
| |
| Gold File Integration Tests |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| |
| Pre-generated json and arrow IPC files (both file and stream format) exist |
| in the `arrow-testing <https://github.com/apache/arrow-testing>`__ repository |
| in the ``data/arrow-ipc-stream/integration`` directory. These serve as |
| *gold* files that are assumed to be correct for use in testing. They are |
| referenced by ``runner.py`` in the code for the :ref:`Archery <archery>` |
| utility. Below are the test cases which are covered by them: |
| |
| * Backwards Compatibility |
| |
| - The following cases are tested using the 0.14.1 format: |
| |
| + datetime |
| + decimals |
| + dictionaries |
| + intervals |
| + maps |
| + nested types (list, struct) |
| + primitives |
| + primitive with no batches |
| + primitive with zero length batches |
| |
| - The following is tested for 0.17.1 format: |
| |
| + unions |
| |
| * Endianness |
| |
| - The following cases are tested with both Little Endian and Big Endian versions for auto conversion |
| |
| + custom metadata |
| + datetime |
| + decimals |
| + decimal256 |
| + dictionaries |
| + dictionaries with unsigned indices |
| + record batches with duplicate fieldnames |
| + extension types |
| + interval types |
| + map types |
| + non-canonical map data |
| + nested types (lists, structs) |
| + nested dictionaries |
| + nested large offset types |
| + nulls |
| + primitive data |
| + large offset binary and strings |
| + primitives with no batches included |
| + primitive batches with zero length |
| + recursive nested types |
| + union types |
| |
| * Compression tests |
| |
| - LZ4 |
| - ZSTD |
| |
| * Batches with Shared Dictionaries |
| |
| Generating new Gold Files |
| ''''''''''''''''''''''''' |
| |
| From time to time, it is desirable to add new gold files, for example when the |
| Columnar format or the IPC specification is update. Archery provides a dedicated |
| option to do that. |
| |
| It is recommended to generate gold files using a well-known version of a Arrow |
| implementation. For example, if a build of Arrow C++ exists in ``./build/release/``, |
| one can generate new gold files in the ``/tmp/gold-files`` directory using the |
| following command: |
| |
| .. code-block:: shell |
| |
| export ARROW_CPP_EXE_PATH=./build/release/ |
| archery integration --with-cpp 1 --write-gold-files=/tmp/gold-files |