| .. Licensed to the Apache Software Foundation (ASF) under one |
| .. or more contributor license agreements. See the NOTICE file |
| .. distributed with this work for additional information |
| .. regarding copyright ownership. The ASF licenses this file |
| .. to you under the Apache License, Version 2.0 (the |
| .. "License"); you may not use this file except in compliance |
| .. with the License. You may obtain a copy of the License at |
| |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| |
| .. Unless required by applicable law or agreed to in writing, |
| .. software distributed under the License is distributed on an |
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| .. KIND, either express or implied. See the License for the |
| .. specific language governing permissions and limitations |
| .. under the License. |
| |
| .. default-domain:: cpp |
| .. highlight:: cpp |
| .. cpp:namespace:: arrow::acero |
| |
| ============== |
| Acero Overview |
| ============== |
| |
| This page gives an overview of the basic Acero concepts and helps distinguish Acero |
| from other modules in the Arrow code base. It's intended for users, developers, |
| potential contributors, and for those that would like to extend Acero, either for |
| research or for business use. This page assumes the reader is already familiar with |
| core Arrow concepts. This page does not expect any existing knowledge in relational |
| algebra. |
| |
| What is Acero? |
| ============== |
| |
| Acero is a C++ library that can be used to analyze large (potentially infinite) streams |
| of data. Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`). |
| An execution plan takes in zero or more streams of input data and emits a single |
| stream of output data. The plan describes how the data will be transformed as it |
| passes through. For example, a plan might: |
| |
| * Merge two streams of data using a common column |
| |
| * Create additional columns by evaluating expressions against the existing columns |
| |
| * Consume a stream of data by writing it to disk in a partitioned layout |
| |
| .. image:: simple_graph.svg |
| :alt: A sample execution plan that joins three streams of data and writes to disk |
| :align: center |
| |
| Acero is not... |
| --------------- |
| |
| A Library for Data Scientists |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| Acero is not intended to be used directly by data scientists. It is expected that |
| end users will typically be using some kind of frontend. For example, Pandas, Ibis, |
| or SQL. The API for Acero is focused around capabilities and available algorithms. |
| However, such users may be interested in knowing more about how Acero works so that |
| they can better understand how the backend processing for their libraries operates. |
| |
| A Database |
| ^^^^^^^^^^ |
| |
| A database (or DBMS) is typically a much more expansive application and often packaged |
| as a standalone service. Acero could be a component in a database (most databases have |
| some kind of execution engine) or could be a component in some other data processing |
| application that hardly resembles a database. Acero does not concern itself with |
| user management, external communication, isolation, durability, or consistency. In |
| addition, Acero is focused primarily on the read path, and the write utilities lack |
| any sort of transaction support. |
| |
| An Optimizer |
| ^^^^^^^^^^^^ |
| |
| Acero does not have an SQL parser. It does not have a query planner. It does not have |
| any sort of optimizer. Acero expects to be given very detailed and low-level instructions |
| on how to manipulate data and then it will perform that manipulation exactly as described. |
| |
| Creating the best execution plan is very hard. Small details can have a big impact on |
| performance. We do think an optimizer is important but we believe it should be |
| implemented independent of acero, hopefully in a composable way through standards such |
| as Substrait so that any backend could leverage it. |
| |
| Distributed |
| ^^^^^^^^^^^ |
| |
| Acero does not provide distributed execution. However, Acero aims to be usable by a distributed |
| query execution engine. In other words, Acero will not configure and coordinate workers but |
| it does expect to be used as a worker. Sometimes, the distinction is a bit fuzzy. For example, |
| an Acero source may be a smart storage device that is capable of performing filtering or other |
| advanced analytics. One might consider this a distributed plan. The key distinction is Acero |
| does not have the capability of transforming a logical plan into a distributed execution plan. |
| That step will need to be done elsewhere. |
| |
| Acero vs... |
| ----------- |
| |
| Arrow Compute |
| ^^^^^^^^^^^^^ |
| |
| This is described in more detail in :ref:`cpp-vs-acero` but the key difference |
| is that Acero handles streams of data and Arrow Compute handles situations where all the |
| data is in memory. |
| |
| Arrow Datasets |
| ^^^^^^^^^^^^^^ |
| |
| The Arrow datasets library provides some basic routines for discovering, scanning, and |
| writing collections of files. The datasets module depends on Acero. Both scanning and |
| writing datasets uses Acero. The scan node and the write node are part of the datasets |
| module. This helps to keep the complexity of file formats and filesystems out of the core |
| Acero logic. |
| |
| Substrait |
| ^^^^^^^^^ |
| |
| Substrait is a project establishing standards for query plans. Acero executes query plans |
| and generates data. This makes Acero a Substrait consumer. There are more details on the |
| Substrait capabilities in :ref:`acero-substrait`. |
| |
| Datafusion / DuckDb / Velox / Etc. |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| |
| There are many columnar data engines emerging. We view this as a good thing and encourage |
| projects like Substrait to help allow switching between engines as needed. We generally |
| discourage comparative benchmarks as they are almost inevitably going to be workload-driven |
| and rarely manage to capture an apples-vs-apples comparison. Discussions of the pros and |
| cons of each is beyond the scope of this guide. |
| |
| .. _cpp-vs-acero: |
| |
| Relation to Arrow C++ |
| ===================== |
| |
| The Acero module is part of the Arrow C++ implementation. It is built as a separate |
| module but it depends on core Arrow modules and does not stand alone. Acero uses |
| and extends the capabilities from the core Arrow module and the Arrow compute kernels. |
| |
| .. image:: layers.svg |
| :alt: A diagram of layers with core on the left, compute in the middle, and acero on the right |
| |
| The core Arrow library provides containers for buffers and arrays that are laid out according |
| to the Arrow columnar format. With few exceptions the core Arrow library does not examine |
| or modify the contents of buffers. For example, converting a string array from lowercase |
| strings to uppercase strings would not be a part of the core Arrow library because that would |
| require examining the contents of the array. |
| |
| The compute module expands on the core library and provides functions which analyze and |
| transform data. The compute module's capabilities are all exposed via a function registry. |
| An Arrow "function" accepts zero or more arrays, batches, or tables, and produces an array, |
| batch, or table. In addition, function calls can be combined, along with field references |
| and literals, to form an expression (a tree of function calls) which the compute module can |
| evaluate. For example, calculating ``x + (y * 3)`` given a table with columns ``x`` and ``y``. |
| |
| .. image:: expression_ast.svg |
| :align: center |
| :alt: A sample expression tree |
| |
| Acero expands on these capabilities by adding compute operations for streams of data. For |
| example, a project node can apply a compute expression on a stream of batches. This will |
| create a new stream of batches with the result of the expression added as a new column. These |
| nodes can be combined into a graph to form a more complex execution plan. This is very similar |
| to the way functions are combined into a tree to form a complex expression. |
| |
| .. image:: simple_plan.svg |
| :alt: A simple plan that uses compute expressions |
| |
| .. note:: |
| Acero does not use the :class:`arrow::Table` or :class:`arrow::ChunkedArray` containers |
| from the core Arrow library. This is because Acero operates on streams of batches and |
| so there is no need for a multi-batch container of data. This helps to reduce the |
| complexity of Acero and avoids tricky situations that can arise from tables whose |
| columns have different chunk sizes. Acero will often use :class:`arrow::Datum` |
| which is a variant from the core module that can hold many different types. Within |
| Acero, a datum will always hold either an :class:`arrow::Array` or a :class:`arrow::Scalar`. |
| |
| Core Concepts |
| ============= |
| |
| ExecNode |
| -------- |
| |
| The most basic concept in Acero is the ExecNode. An ExecNode has zero or more inputs and |
| zero or one outputs. If an ExecNode has zero inputs we call it a source and if an ExecNode |
| does not have an output then we call it a sink. There are many different kinds of nodes and |
| each one transforms its inputs in different ways. For example: |
| |
| * A scan node is a source node that reads data from files |
| * An aggregate node accumulates batches of data to compute summary statistics |
| * A filter node removes rows from the data according to a filter expression |
| * A table sink node accumulates data into a table |
| |
| .. note:: |
| A full list of the available compute modules is included in the :ref:`user's guide<ExecNode List>` |
| |
| .. _exec-batch: |
| |
| ExecBatch |
| --------- |
| |
| Batches of data are represented by the ExecBatch class. An ExecBatch is a 2D structure that |
| is very similar to a RecordBatch. It can have zero or more columns and all of the columns |
| must have the same length. There are a few key differences from ExecBatch: |
| |
| .. figure:: rb_vs_eb.svg |
| |
| Both the record batch and the exec batch have strong ownership of the arrays & buffers |
| |
| * An ``ExecBatch`` does not have a schema. This is because an ``ExecBatch`` is assumed to be |
| part of a stream of batches and the stream is assumed to have a consistent schema. So |
| the schema for an ``ExecBatch`` is typically stored in the ExecNode. |
| * Columns in an ``ExecBatch`` are either an ``Array`` or a ``Scalar``. When a column is a ``Scalar`` |
| this means that the column has a single value for every row in the batch. An ``ExecBatch`` |
| also has a length property which describes how many rows are in a batch. So another way to |
| view a ``Scalar`` is a constant array with ``length`` elements. |
| * An ``ExecBatch`` contains additional information used by the exec plan. For example, an |
| ``index`` can be used to describe a batch's position in an ordered stream. We expect |
| that ``ExecBatch`` will also evolve to contain additional fields such as a selection vector. |
| |
| .. figure:: scalar_vs_array.svg |
| |
| There are four different ways to represent the given batch of data using different combinations |
| of arrays and scalars. All four exec batches should be considered semantically equivalent. |
| |
| Converting from a record batch to an exec batch is always zero copy. Both RecordBatch and ExecBatch |
| refer to the exact same underlying arrays. Converting from an exec batch to a record batch is |
| only zero copy if there are no scalars in the exec batch. |
| |
| .. note:: |
| Both Acero and the compute module have "lightweight" versions of batches and arrays. |
| In the compute module these are called ``BatchSpan``, ``ArraySpan``, and ``BufferSpan``. In |
| Acero the concept is called ``KeyColumnArray``. These types were developed concurrently |
| and serve the same purpose. They aim to provide an array container that can be completely |
| stack allocated (provided the data type is non-nested) in order to avoid heap allocation |
| overhead. Ideally these two concepts will be merged someday. |
| |
| ExecPlan |
| -------- |
| |
| An ExecPlan represents a graph of ExecNode objects. A valid ExecPlan must always have at |
| least one source node but it does not technically need to have a sink node. The ExecPlan contains |
| resources shared by all of the nodes and has utility functions to control starting and stopping |
| execution of the nodes. Both ExecPlan and ExecNode are tied to the lifecycle of a single execution. |
| They have state and are not expected to be restartable. |
| |
| .. warning:: |
| The structures within Acero, including ``ExecBatch``, are still experimental. The ``ExecBatch`` |
| class should not be used outside of Acero. Instead, an ``ExecBatch`` should be converted to |
| a more standard structure such as a ``RecordBatch``. |
| |
| Similarly, an ExecPlan is an internal concept. Users creating plans should be using Declaration |
| objects. APIs for consuming and executing plans should abstract away the details of the underlying |
| plan and not expose the object itself. |
| |
| Declaration |
| ----------- |
| |
| A Declaration is a blueprint for an ExecNode. Declarations can be combined into a graph to |
| form the blueprint for an ExecPlan. A Declaration describes the computation that needs to be |
| done but is not actually responsible for carrying out the computation. In this way, a Declaration is |
| analogous to an expression. It is expected that Declarations will need to be converted to and from |
| various query representations (e.g. Substrait). The Declaration objects are the public API, combined |
| with the DeclarationToXyz methods, are the current public API for Acero. |
| |
| .. figure:: decl_vs_ep.svg |
| |
| A declaration is a blueprint that is used to instantiate exec plan instances |