This article introduces the differences, advantages, and applicable scenarios of the compute-storage coupled mode and compute-storage decoupled mode of Doris, providing a reference for users' selection.
The following sections will describe in detail how to deploy and use Apache Doris in the compute-storage decoupled mode. For information on deployment in compute-storage coupled mode, please refer to the Cluster Deployment section.
The overall architecture of Doris consists of two types of processes: Frontend (FE) and Backend (BE). The FE is primarily responsible for user request access, query parsing and planning, metadata management, and node management. The BE is responsible for data storage and query plan execution. (More information)
In the compute-storage coupled mode, the BE nodes perform both data storage and computation, and multiple BE nodes forms a massively parallel processing (MPP) distributed computing architecture.
The BE nodes no longer store the primary data. Instead, the shared storage layer serves as the unified primary data storage. Additionally, to overcome the performance loss caused by the limitations of the underlying object storage system and the overhead of network transmission, Doris introduces a high-speed cache on the local compute nodes.
Meta data layer:
The FE stores metadata, job information, permissions, and other MySQL protocol-dependent data.
Meta Service is the Doris metadata service in the compute-storage decoupling mode. It is responsible for data import transaction processing, tablet meta, rowset meta, and cluster resource management. It is a stateless service that can scale horizontally.
Computation layer:
In the compute-storage decoupled mode, the BE nodes are stateless. They cache a portion of the tablet metadata and data to improve query performance.
A compute cluster is a collection of stateless BE nodes serving as the computing resources. Multiple compute clusters share a single set of data, and the compute clusters can be elastically scaled by adding or removing nodes as needed.
:::info
The concept of compute cluster in the compute-storage decoupled mode is distinct from the “cluster” discussed in the [Cluster Deployment] and [Create Cluster] sections.
In the context of the compute-storage decoupled mode, the “Compute Cluster” specifically refers to the collection of stateless BE nodes that serve as the computing resources, rather than the complete distributed system consisting of multiple Apache Doris nodes as described in the [Cluster Deployment] and [Create Cluster] sections.
:::
Shared storage layer:
The shared storage layer stores the data files, including segment files and the inverted index files.
As mentioned earlier, a compute cluster is formed by one or more stateless BE nodes. By using the compute cluster specification statement (use @<compute_group_name>), you can direct specific workloads to specific compute clusters, thus realizing physical isolation of data import and query workloads.
Assuming there are 2 compute clusters: C1 and C2.
Read isolation: Before initiating two large queries, you can leverage use @c1 and use @c2 respectively to make the two queries run on different compute nodes. In this way, the two queries will not interfere with each other due to competition for CPU and memory resources when accessing the same dataset.
Read-write isolation: Data imports can consume resources, especially with large data volumes and high import frequency. To avoid resource contention between queries and imports, you can specify query requests to run on C1 and import requests to run on C2 using use @c1 and use @c2. Meanwhile, the c1 compute cluster can access the newly imported data in the c2 compute cluster.
Write-write isolation: Data import tasks can also be isolated from each other. In some cases, the system handles both high-frequency small imports and large-scale batch imports. The batch imports often take longer and have higher retry costs, while the high-frequency small imports are the opposite. To avoid small imports interfering with batch imports, you can direct the small imports to run on c1 and the batch imports to run on c2 via use @c1 and use @c2.