Vectorized execution engine is an experimental feature added to the current version of Doris. The goal is to replace the current Doris row-based SQL execution engine, fully release the computing power of modern CPUs, and break through the performance shortcomings of Doris in the SQL execution engine.
Its specific design, implementation and effects can be found in ISSUE 6238.
The current Doris SQL execution engine is based on the row-based memory format and is designed based on the traditional volcano model. There is a lot of unnecessary overhead when performing SQL operators and function operations:
The resulting series of overheads makes the current Doris execution engine inefficient and does not adapt to the architecture of modern CPU.
And as shown in the figure below (quoted from Column-Oriented Database Systems), the vectorized execution engine redesigns the SQL execution engine of the columnar storage system based on the characteristics of modern CPU and the execution characteristics of the volcano model:
This greatly improves the efficiency of the CPU when executing SQL and improves the performance of SQL queries.
Set the session variable enable_vectorized_engine
to true
, then FE will convert SQL operators and SQL expressions into vectorized execution plans by default when performing query planning.
set enable_vectorized_engine = true;
batch_size
represents the number of rows that the SQL operator performs batch calculations on each time. The default configuration of Doris is 1024
. The number of lines in this configuration will affect the performance of the vectorized execution engine and the behavior of CPU cache prefetching. The recommended configuration here is 4096
.
set batch_size = 4096;
Performance degradation due to NULL value in the vectorized execution engine. Therefore, when creating a table, setting the corresponding column to NULL usually affects the performance of the vectorized execution engine. It is recommended to use some special column values to represent NULL values, and set the columns to NOT NULL when creating the table to give full play to the performance of the vectorized execution engine.
You can use the explain
command to check whether the current SQL has the vectorized execution engine enabled:
+-----------------------------+ | Explain String | +-----------------------------+ | PLAN FRAGMENT 0 | | OUTPUT EXPRS:<slot 0> TRUE | | PARTITION: UNPARTITIONED | | | | VRESULT SINK | | | | 0:VUNION | | constant exprs: | | TRUE | +-----------------------------+
After the vectorized execution engine is enabled, V
mark will be added before the SQL operator in the SQL execution plan.
In most scenarios, users only need to turn on the session variable by default to transparently enable the vectorized execution engine and improve the performance of SQL execution. However, the current vectorized execution engine is different from the original row-stored execution engine in the following minor details, which requires users to know. This part of the difference is divided into two categories
microseconds_add
and other functions that calculate milliseconds are not supported.0
and -0
are considered equal in SQL execution. This may affect the results of calculations like distinct
, group by
, etc.UDF
and UDAF
of the original row storage execution engine are not supported.string/text
type is 1MB instead of the default 2GB. That is, when the vectorization engine is turned on, it is impossible to query or import strings larger than 1MB. However, if you turn off the vectorization engine, you can still query and import normally.select ... into outfile
is not supported.